Identifying the most critical components and maximizing their availability subject to limited cost in cooling subsystems

(1)

Identifying the most critical components and maximizing their availability subject to limited cost in cooling subsystems

Universidade Federal de Pernambuco posgraduacao@cin.ufpe.br http://cin.ufpe.br/~posgraduacao

Recife 2019

(2)

Identifying the most critical components and maximizing their availability subject to limited cost in cooling subsystems

Trabalho apresentado ao Programa de Pós-graduação em Ciência da Computação do Centro de Informática da Universidade Federal de Pernambuco como requisito parcial para obtenção do grau de Mestre em Ciência da Computação.

Área de Concentração: Avaliação de

Desempenho

Orientador: Prof. Dr. Djamel Sadok

Coorientador: Prof. Dr. Glauco Gonçalves

Recife 2019

(3)

Catalogação na fonte

Bibliotecária Monick Raquel Silvestre da S. Portes, CRB4-1217

G633i Gomes, Demis Moacir

Identifying the most critical components and maximizing their availability subject to limited cost in cooling subsystems / Demis Moacir Gomes. – 2019.

83 f.: il., fig., tab.

Orientador: Djamel Sadok.

Dissertação (Mestrado) – Universidade Federal de Pernambuco. CIn, Ciência da Computação, Recife, 2019.

Inclui referências.

1. Avaliação de desempenho. 2. Disponibilidade. 3. Análise de sensibilidade. I. Sadok, Djamel (orientador). II. Título.

004.029 CDD (23. ed.) UFPE- MEI 2019-098

(4)

Demis Moacir Gomes

“_{Identifying the most critical components and maximizing their}

availability subject to limited cost in cooling subsystems”

Dissertação de Mestrado apresentada ao Programa de Pós-Graduação em Ciência da Computação da Universidade Federal de Pernambuco, como requisito parcial para a obtenção do título de Mestre em Ciência da Computação.

Aprovado em: 07/03/2019.

BANCA EXAMINADORA

__________________________________________________ Prof. Dr. Nelson Souto Rosa

Centro de Informática / UFPE

_________________________________________________ Prof. Dr. Gustavo Rau de Almeida Callou

Departamento de Computação / UFRPE

_________________________________________________ Prof. Dr. Djamel Fawzi Hadj Sadok

Centro de Informática / UFPE (_Orientador)

(5)

(6)

I would like to thank my family. My father Moacir, my mother Ozani, and my brother Lucas who were essential to conclude this step. Though I was not present as in other moments, I would to say that I love them so much. I thank my relatives for the support in this journey, specially to Ozilene, Israel, Josenildo, Rozineide, and my grandparents Severino and Alzemir.

I also warmly thank my friends, including professors and staff that I met at UFRPE and keep their friendship, specially Blenda, Daniel Cândido, Jorge, and Gabriel. To my friends in Abreu e Lima, specially Camilla, Diego, Gabriel, and Ricardo. Finally, to my colleagues Guto, Daniel Rosendo, Leylane, André, Daniel Bezerra, Jairo, Gibson, João Monte, Pedro, Adalberto, and Carolina Cani that I meet at GPRT, where I stayed for three years.

My special thanks for Prof. Judith and my advisor Prof. Djamel, that supported me at GPRT by three years. I thank Prof. Patricia Endo to support in all moments, and my co-advisor and friend Prof. Glauco Gonçalves to support me in the writing and be present when I needed help.

Finally, I thank Lisandra, my girlfriend. Her support and love were essential to ac-complish this objective. Also, I thank her family, Silvia, Josué, and Lucas, that treat me as part of their family.

(7)

Cooling plays an important role on data center (DC) availability, mitigating the Tech-nology of Information (IT) components’ overheating. Although several works evaluate the performance of cooling subsystem in a DC, a few studies consider the significant relation-ship between cooling and IT subsystems. Moreover, a DC provider has limited tools in order to choose its IT and cooling components to obtain a desired availability subject to limited cost. This work provides scalable models (using Stochastic Petri Nets - SPN) to represent a cooling subsystem and to analyze its failures’ impact concerning financial costs and service downtime. This study also identifies the components that most impact on DC availability, as well as proposes a strategy to maximize the DC availability with a lim-ited budget. Notwithstanding, the optimization process to maximize availability becomes very costly when used the proposed DC SPN models due to time-to-solve, which leads to the application of cheaper models, however, efficient, called surrogate models. In order to apply the most accurate surrogate model for optimization tasks, this work compares three surrogate models strategies. In the optimization, based on solutions obtained in the chosen surrogate model, there is a three-algorithm comparison to choose one with best results. Results show that a more redundant cooling architecture reduces costs in 70%. Cooling components’ analysis identified the chiller as the most impactful component con-cerning availability. Regarding surrogate models based on DC model, Gaussian Process (GP) obtained more confident results. Finally, Differential Evolution (DE) had the best results on availability’s maximization in a DC.

Keywords: Availability. Surrogate models. Acquisition costs. Sensitivity analysis.

(8)

A refrigeração possui um papel importante na disponibilidade de um data center (DC), mitigando o superaquecimento dos equipamentos de Tecnologia da Informação (TI). Embora muitos trabalhos avaliem o desempenho do subsistema de refrigeração em um DC, poucos deles consideraram a relação importante entre os subsistemas de refrigeração e TI. Além disso, um provedor de DC possui ferramentas limitadas para escolher seus equipamentos de TI e refrigeração de modo a obter uma disponibilidade desejada mesmo com custos limitados. Este trabalho provê modelos escaláveis (usando Redes de Petri Estocásticas) para representar um subsistema de refrigeração e analisar o impacto de suas falhas com respeito a custos financeiros e downtime do serviço. O estudo também identifica os componentes que mais influenciam na disponibilidade do DC, além de propor uma estratégia para maximizar a disponibilidade do DC com um limitado orçamento. No entanto, a tarefa de otimização para maximizar a disponibilidade se torna extremamente custosa com o uso dos modelos estocásticos devido ao seu tempo de solução, o que leva à aplicação de modelos menos complexos porém muito eficientes, os chamados modelos

surrogate. De modo a aplicar o modelo surrogate de melhor acurácia para tarefas de

otimização, este trabalho compara três estratégias de modelo surrogate. Com respeito à otimização, outros três algoritmos são comparados a partir de soluções obtidas usando o modelo surrogate escolhido de modo a avaliar qual traz os melhores resultados. Os resultados mostram que a adoção de uma arquitetura de refrigeração mais redundante reduz os custos em cerca de 70%. A análise dos componentes de refrigeração identificou o chiller como o componente que mais afetou a disponibilidade. Em relação aos modelos

surrogate baseados no modelo de DC, o Gaussian Process (GP) alcançou resultados mais

confidentes. Por último, o Differential Evolution (DE) obteve os melhores resultados na maximização da disponibilidade de um DC.

Palavras-chave: Disponibilidade. Modelos surrogate. Custos de aquisição. Análise de

(9)

Figure 1 – Petri net example . . . 19

Figure 2 – Components of a cooling subsystem in a chilled water architecture. . . 20

Figure 3 – Components of a IT DC subsystem (based on (SANTOS et al., 2017)) . . 21

Figure 4 – A GP regression model example . . . 24

Figure 5 – A regression tree example. In (a), the prediction function y(x); In (b), the regression tree based on splits. . . 25

Figure 6 – An RF model example (based on <https://dsc-spidal.github.io/harp/ docs/examples/rf/)> . . . 25

Figure 7 – A GBM regression example. In (a), the prediction function y(x); In (b), the residuals (errors) of each prediction. Images based on <https: //www.kaggle.com/grroverpr/gradient-boosting-simplified/notebook> 28 Figure 8 – Latin Square plan example. LHS extends the concept for an arbitrary number of dimensions. . . 30

Figure 9 – General Evolutional Algorithms process. Based on (STREICHERT, 2002) 31 Figure 10 – GA crossover for 𝐷 = 5 parameters and pair (𝑋1,𝑋3). . . 32

Figure 11 – NSGA procedure example for a population with size 𝑁 = 8. Based on (DEB et al., 2002) . . . 34

Figure 12 – DE crossover for 𝐷 = 5 parameters. DE selects one random parameter from 𝑀𝑗𝑖,𝐺+1 (in this example, the second) in order to differ trial and target vectors. Based on (STORN; PRICE, 1997) . . . 35

Figure 13 – Cooling subsystem architectures from C1 to C4 (based on (CALLOU et al., 2012)) . . . 39

Figure 14 – Proposed SPN model for the C1 cooling architecture . . . 40

Figure 15 – SPN model proposed for the C4 cooling architecture . . . 41

Figure 16 – Methodology for availability calculus . . . 45

Figure 17 – IT subsystem SPN model . . . 47

Figure 18 – Linear regression plot for aggregation router MTTF. . . 49

Figure 19 – Complete DC model with IT and cooling subsystems. . . 50

Figure 20 – Chiller MTTF SA in C1 architecture. . . 54

Figure 21 – Cooling tower MTTF SA in C1 architecture. . . 54

Figure 22 – CRAC MTTF SA in C1 architecture. . . 55

Figure 23 – Chiller MTTF SA in C3 architecture. . . 56

Figure 24 – Cooling tower SA in C2 architecture. . . 56

Figure 25 – CRAC MTTF SA in C4 architecture . . . 57

Figure 26 – Surrogate model process . . . 60

(10)

Figure 29 – Surrogate models comparison from 100 to 800 - RMSE . . . 64

Figure 30 – Inputs of optimization algorithms . . . 65

Figure 31 – Optimization algorithms comparison concerning availability . . . 67

Figure 32 – Costs of DC for generating the maximum availability values performed by each algorithm. . . 68

Figure 33 – Availabilties of DC - 40 racks . . . 69

Figure 34 – Costs of DC - 40 racks . . . 69

Figure 36 – Costs of DC - 45 racks . . . 70

(11)

Table 2 – A simple GBM regression with values referring to three first iterations . 27 Table 3 – A simple reproduction step with Genetic Algorithm (GA). Based on

(GOLDBERG, 1989) . . . 32

Table 4 – An example of fast-non-dominated sort algorithm with 𝑁 = 5. . . 33

Table 5 – Comparison among this and related work . . . 37

Table 6 – Guard functions of immediate transitions with respect to the C4 cooling architecture SPN model . . . 41

Table 7 – Availability calculus of cooling subsystem from C1 to C4 architectures in (CALLOU et al., 2012) and proposed models. . . 42

Table 8 – MTTR and MTTF values (CALLOU et al., 2013) used at each transition of the cooling subsystem model . . . 42

Table 9 – Models comparison . . . 43

Table 10 – Modified availability formula for the Callou’s C4 cooling architecture model, considering five active CRACs for system availability. . . 43

Table 11 – Comparison of proposed model and (CALLOU et al., 2012) model using the availability calculus in Table 10 . . . 44

Table 12 – Comparison of solution time of each model . . . 44

Table 13 – Guard functions with respect to IT components . . . 48

Table 14 – Default values and semantics of transitions used in DC model . . . 48

Table 15 – Value of failure transitions based on number of CRACs via linear regression 50 Table 16 – Guard functions related to cooling and IT subsystems . . . 50

Table 17 – Impact of cooling architectures in the availability and DC downtime cost 51 Table 18 – Price and energy consumption of all DC components . . . 52

Table 19 – Acquisition, operational (per year), and total costs in a year in U$ . . . 52

Table 20 – Sensitivity analysis of DC components in each one of four cooling archi-tectures. . . 53

Table 21 – Range of MTTF and MTTR values of 20 input parameters . . . 61

Table 22 – Best tune parameters for each surrogate model (800 points) . . . 63

Table 23 – Parameters for optimization algorithms . . . 66

Table 24 – Parameter results derived from optimization algorithms for budgets 600,000, 800,000, and 1M . . . 68

Table 25 – MTTF results derived from optimization algorithms for budgets 600,000, 800,000, and 1M with 40 racks . . . 69

Table 26 – MTTF results derived from optimization algorithms for budgets 600000, 800000, and 1M with 45 racks . . . 71

(12)

800000, and 1M with 50 racks . . . 72 Table 28 – Summary of the best results found by optimization algorithms with a

(13)

AFCOM Association for Computer Operations Management

AWS Amazon Web Services

CPU Central Processing Unit

CRAC Computer Room Air Conditioner CTMC Continuous Time Markov Chain

DC Data Center

DE Differential Evolution

GA Genetic Algorithm

GBM Gradient Boosting Machine

GP Gaussian Process

IT Information Technology

IWGCR International Working Group on Cloud Computing Resiliency

LHS Latin Hypercube Sampling

MAE Mean Absolute Error

MTTF Mean Time to Failure

MTTR Mean Time to Repair

NAS Network Attached Storage

NIC Network Interface Card

NSGA-II Non-dominated Sorting Genetic Algorithm II PUE Power Usage Effectiveness

RAM Random Access Memory

RBD Reliability Block Diagram

RF Random Forest

RMSE Root Mean Squared Error

SA Sensitivity Analysis

SPN Stochastic Petri Net

(14)

1 INTRODUCTION . . . 15

1.1 MOTIVATION . . . 16

1.2 OBJECTIVES . . . 17

1.3 ORGANIZATION OF THE WORK . . . 17

2 BACKGROUND . . . 18 2.1 PETRI NETS . . . 18 2.2 DC COMPONENTS . . . 20 2.3 SENSITIVITY ANALYSIS . . . 22 2.4 SURROGATE MODELS . . . 22 2.4.1 Gaussian Process . . . 23 2.4.2 Random Forest . . . 24

2.4.3 Gradient Boost Machine . . . 26

2.5 LATIN HYPERCUBE SAMPLING . . . 29

2.6 OPTIMIZATION BASED ON EVOLUTIONARY ALGORITHMS . . . 30

2.6.1 Genetic Algorithm . . . 30

2.6.2 Non-dominated Sorting Genetic Algorithm . . . 32

2.6.3 Differential Evolution . . . 34 2.7 RELATED WORK . . . 35 2.8 CONCLUDING REMARKS . . . 38 3 COOLING MODELS . . . 39 3.1 DESCRIPTION . . . 39 3.2 VALIDATION . . . 42 3.3 CONCLUDING REMARKS . . . 44

4 DC MODEL: METHODOLOGY AND RESULTS . . . 45

4.1 METHODOLOGY . . . 45

4.2 IT MODEL . . . 47

4.3 COUPLING TEMPERATURE VARIATION INTO DC MODEL . . . 48

4.4 RESULTS . . . 51

4.4.1 DC availability and costs . . . 51

4.4.2 Sensitivity Analysis . . . 52

4.5 CONCLUDING REMARKS . . . 58

5 ESTIMATING AND MAXIMIZING AVAILABILITY . . . 59

(15)

5.1.2 Results . . . 62

5.2 OPTIMIZING A DC . . . 64

5.2.1 Methodology and implementation . . . 64

5.2.2 Results . . . 66

5.2.2.1 Maximizing availability without concerns with the number of racks . . . 66

5.2.2.2 Restricting the number of racks . . . 68

5.3 CONCLUDING REMARKS . . . 72 6 CONCLUSION . . . 74 6.1 LIMITATIONS . . . 75 6.2 CONTRIBUTIONS . . . 76 6.3 PUBLICATIONS . . . 76 6.4 FUTURE WORKS . . . 77 REFERENCES . . . 78

(16)

1 INTRODUCTION

In a Data Center (DC), unplanned interruptions caused by software, hardware, human error, and cyber attacks (ENDO et al., 2017b) lead to high financial losses and, should,

therefore, be prevented. According to International Working Group on Cloud Computing Resiliency (IWGCR) (GAGNAIRE et al., 2014), services such as Amazon Web Services

(AWS) and Microsoft Azure suffer a deficit of $336,000 (dolars) per hour in case of failure. To avoid financial and reputation losses, a cloud provider ought to know its infrastructure limitations to ensure high availability of its hosted services.

A DC can be decomposed into three subsystems: Information Technology (IT) sub-system, comprising racks, servers, and network equipment; power subsub-system, containing generators, batteries, and switchgear; and cooling subsystem, with air conditioners and other reject-heat equipment (BARROSO; CLIDARAS; HÖLZLE, 2013). The cooling subsystem

has an essential role in maintaining DC activities while occupying a capital expenditure between 5 to 20% of the whole DC (KOSIK; GENG, 2015).

The importance of the cooling subsystem could be seen after a failure in Microsoft Azure’s DC at Japan in March, 20171_{. According to Azure, the interruption occurred due}

to a cooling component failure, leading to the DC management staff turning off servers to avoid overheating. Therefore, the analysis of the cooling subsystem together with other subsystems is essential to design a reliable DC. There is, therefore, a need to identify its most critical components and, possibly, improve their management.

One way to analyze a cooling subsystem is through measurements. However, this strat-egy is unfeasible, since it can impact the DC activities that would be suspended for testing and difficult the estimate for a long time period. Hence, this work proposes to assess this subsystem through models using the Stochastic Petri Net (SPN) formalism(MACIEL; LINS; CUNHA, 1996). The model refines previous SPN models (CALLOU et al., 2012)(CALLOU et al., 2013) by adding scalability to faithfully represent scenarios with a higher number of

components.

The scalable cooling subsystem model helps to design a DC model in a straightforward manner, through combination of IT and cooling subsystems. In a real scenario, the failure of a single air-conditioner equipment, called Computer Room Air Conditioner (CRAC) would increase the IT components’ temperature and, consequently, their failure probabil-ity. In order to model this particular behavior, this work proposes a DC model considering a higher probability of failure in IT components when a CRAC failure occurs. The pro-posed DC models allow estimating availability, downtime costs, and the most sensitive components of the whole DC equipment. This work also considers acquisition and

opera-1 http://www.datacenterknowledge.com/archives/2017/03/31/data-center-cooling-outage-disrupts-azure-cloud-in-japan/

(17)

tion costs that, coupled with downtime costs, allow estimating the total costs of a DC in a year. Therefore, this work helps a DC manager to choose the best cooling architecture for its IT load based on availability and cost, in conjunction with the identification of the most sensitive DC components that need improvement or upgrade.

This work also helps a DC provider to plan its data center based on a limited budget, in which the provider may estimate which data center architecture would be better in a particular scenario. Optimization algorithms find optimal solutions based on cost limita-tions. However, as the optimization process solves complex stochastic models a thousand of times, the time to search the solution space turns this process unfeasible. An alternative for solving this problem lies on the derivation of the stochastic models in a cheaper model with approximate results, called a surrogate model (WANG et al., 2014b). Surrogate

mod-els give the estimate (in our case, availability) in a O(1) time, i.e., instantly. This allows the usage of optimization algorithms, generating the results that obtain the maximum availability with a limited budget.

1.1 MOTIVATION

A DC must be efficient and reliable. Therefore, a provider aims to maintain the maximum IT components possible at an operational temperature. Moreover, the identification of the components that most impact on availability allow a provider to change the maintenance policies and increase the equipment redundancy. Then, a DC provider may reduce the costs related to cooling subsystem while improving its IT equipment, which will guarantee greater profits. One of the metrics used for measuring DC efficiency is called Power Usage Effectiveness (PUE) (KOSIK; GENG, 2015), that divides the power delivered for DC by

power delivered for IT equipment. The use of more efficient cooling components would reduce the PUE to values closer to one, where a PUE very close to one indicates a very efficient DC. For example, Google DCs have 1.12 of average PUE 2

In order to increase the profits, an optimum configuration based on provider’s budget could estimate which DC architecture would be better in a particular scenario. Depending on the optimization strategy, the model needs to run a thousand of times. A model that lasts a long time to solve would require much optimization time. As an example, in a computer with Intel(R) Core(TM) i7-3770 Central Processing Unit (CPU) @ 3.40GHz and 16 GB Random Access Memory (RAM), a model that runs in 5 minutes and has an optimization task based on evolutionary strategies (STREICHERT, 2002) that requires

10,000 runs would last more than one month only considering the solve time. Thus, using an SPN model may be unfeasible for a DC provider perspective, once it would wait one month to get an output that could be discarded. An alternative to reduce the solution ex-ecution time lies in build a surrogate model based on outcomes of original one, which gives

(18)

the definition of “a model of a model” (WANG et al., 2014b). The application of surrogate

models includes geostatistics (KRIGE, 1951), (NAWAR; MOUAZEN, 2017), (JUNIOR, 2018),

and computer experiments (SACKS et al., 1989), (BOOKER, 1998), (LIFSHITS, 2012). The

reduced solving time to a O(1) will increase the efficiency and experience of DC provider in performing different optimization tasks in a small time scale. As a result of this study, an optimization strategy based on surrogate models will estimate the essential number of cooling components in order to maximize the availability, cools IT components and not to surpass the limited budget.

1.2 OBJECTIVES

The main objective of this study comprises the estimate of DC availability and costs based on cooling architectures, identifying the most sensitive components and optimizing the DC configuration to find the maximum availability subject to limited costs. To accomplish the main objective, the following specific objectives must be achieved:

• Refine the literature cooling models to increase scalability, which facilitates the integration with IT model;

• Evaluate the DC model (cooling and IT subsystems) considering temperature vari-ation concerning availability and costs related to downtime, acquisition, and oper-ation;

• Identify the most sensitive components to increase redundancy or their efficiency; • Build a surrogate model that behaves similarly to the original one, comparing

dif-ferent strategies;

• Maximize the DC availability based on limited budget, comparing different opti-mization algorithms.

1.3 ORGANIZATION OF THE WORK

This work, after the introductory chapter, discusses important concepts concerning SPNs, DC components, and sensitivity analysis in chapter 2. Chapter 3 presents the cooling models and their validation. Chapter 4 details the DC models with the addition of cool-ing components and measures the impact of these components concerncool-ing availability and costs regarding downtime, acquisition, and operation of components. Moreover, this chapter presents the sensitivity analysis. Chapter 5 presents the process used to generate the surrogate model, compare the options and optimize the DC configuration comparing three algorithms. Finally, chapter 6 concludes this study and provides future works.

(19)

2 BACKGROUND

This chapter discusses the main concepts used in this work. Section 2.1 presents Petri Nets; section 2.2 shows the IT and cooling DC components; section 2.3 depicts how to identify sensitive components using Sensitivity Analysis (SA); section 2.4 presents the surrogate models used in this work; section 2.5 presents the Latin Hypercube Sampling (LHS), a technique for selecting non-overlaped inputs for surrogate model training; section 2.6 shows the optimization algorithms based on evolutionary strategies that are used in this work; section 2.7 presents the related works, and section 2.8 concludes this chapter. 2.1 PETRI NETS

Analytical modeling allows measuring system’s metrics such as availability, reliability, and performability (MACIEL et al., 2012). In the case of availability studies, three modeling

forms have mostly been used: Reliability Block Diagram (RBD), Markov Chains, and Petri Nets. Each formalism follows a different modeling strategy, which gives to its user a different level of abstraction. The RBD, for example, can be solved using closed-form equations and statistical tools to obtain the general reliability of a system. Therefore it does not model behavioral aspects like run-time decisions. Next, we present more details regarding Petri Nets and their relation with Markov Chains, which offer a more powerful way to model a DC.

Petri Nets are used for performance evaluation through the definition of tokens, transi-tions, and places. The formal definition based on matrices states the Petri Nets as a quintu-ple 𝑅 = (𝑃, 𝑇, 𝐼, 𝑂, 𝐾), where 𝑃 is a set of places; 𝑇 a set of transitions; 𝐼 : 𝑃 ×𝑇 =⇒ N the input preconditions; 𝑂 : 𝑃 × 𝑇 =⇒ N comprises the output post conditions; and 𝐾 the vector of places’ capacities (MACIEL; LINS; CUNHA, 1996). Although there are some

classifications of Petri Nets, one of their main applications is generating Continuous Time Markov Chain (CTMC) (BAUSE; KRITZINGER, 2002). CTMC is a widely used technique

that evaluates performance and dependability characteristics of systems based on contin-uous distributions (BAIER et al., 2003). However, depending on the system complexity, its

respective CTMC can be very complicated and hard to solve.

Therefore, Petri Nets allow to model systems in a higher abstraction level when com-pared to CTMC, decreasing the modeling time of complex systems, as in the case of a DC. The only restriction to use Petri Nets to generate a CTMC lies in the use of timed transitions following exponential distributions, which results in what is known as a SPN. There are many other Petri Nets variations, including those adopting non-markovian dis-tributions (DAVID; ALLA, 2010) (MACIEL; LINS; CUNHA, 1996).

(20)

and 𝑆𝑌 𝑆_𝐷𝑂𝑊 𝑁), which may contain small black circles (tokens). White rectangles (𝐸𝑇 1 and 𝐸𝑇 2) represent exponential transitions that, after a delay that follows an ex-ponential distribution, get a number of tokens from input place and put these tokens into the output place. Places and transitions are connected by arcs, in which the arc weight defines the number of tokens collected from the input and produced at the output. In this example, the arc weight equals one (when the number of tokens consumed is 1 it is omitted in the graphical representation), i.e., when 𝐸𝑇 2 fires, it obtains one token from

𝑆𝑌 𝑆_𝑈𝑃 and produces one token at 𝑆𝑌 𝑆_𝐷𝑂𝑊 𝑁. Black rectangles (𝐼𝑇 1 and 𝐼𝑇 2)

state immediate transitions firing instantly if its requirements are satisfied.

Figure 1 – Petri net example

Furthermore, Petri Nets support guard functions. Guard functions (𝑔𝑓) enable the transition when this function returns true. In this example, the system will be unavailable (there will be one token at 𝑆𝑌 𝑆_𝑈𝑁) when the network (assuming other model that represents it) is down, and it becomes operational (up) when the repair of the network was completed.

Petri Nets also define two types of firing semantics: single server and infinite server. A transition following the single server semantics always fires once at a time, if enabled. Adversely, transitions following the infinite server semantics treat each token in the en-abling place individually and can fire 𝑞 times if enabled, where 𝑞 is the number of tokens given condition to fire (BAUSE; KRITZINGER, 2002). In the Figure 1, if an infinite server

semantic is assumed, it comprises two active instances of a system in parallel, where the delay time counts independently for both instances at the same time. On the other hand, a single server semantic represents an active-standby system, i.e., the delay time counts only for the active instance. When 𝐸𝑇 2 fires, the delay time starts to count for the other instance that becomes active.

This work uses SPNs for DC modeling in a high abstraction level, generating corre-sponding CTMCs and solving them analytically to obtain a steady-state solution. One of the objectives comprises the use of strategies to mitigate the state explosion problem (VALMARI, 1998). This problem occurs when an SPN model generates a high number

(21)

of states in its respective CTMC, which may prevent solving the CTMC in case of lack of computational resources. Therefore, we are also concerned in providing an efficient solution, one that diminishes the state explosion problem when modeling DC availability. 2.2 DC COMPONENTS

Gartner defines a data center as a department that host a large number of IT components and stores a lot of data in a centralized way1_{. The Association for Computer Operations}

Management (AFCOM) standardize the DC in two metrics: size and density. The size of a DC varies from mini (1-10 racks into 1-25 m2) until mega (more than 9000 racks and

22,500 m2). On the other hand, the density varies among low (less than 4kw per rack)

and extreme (more than 16 kw per rack) 2.

The DC is composed of three subsystems: power, cooling and IT. Power subsystem provides electricity both cooling as IT subsystem, and has a high attention both in lit-erature as in industry. In general, power has a lot of redundancy with many batteries to maintain the generator working in case of utility failure - the main provider of electricity of DC. For more details, see (ROSENDO et al., 2018) (ROSENDO et al., 2017).

Cooling and IT subsystems have different architectures and components depending on the applied strategy (EVANS, 2012). Regarding the DC cooling subsystem, this work

considers a chilled water system as other works in literature (CALLOU et al., 2012)(CALLOU et al., 2013). This type of cooling subsystem is composed of a CRAC (Computer Room

Air Conditioning), chiller, and a cooling tower, as shown in Figure 2. The system cools the DC through water at a low temperature.

Figure 2 – Components of a cooling subsystem in a chilled water architecture.

1 _{https://www.gartner.com/it-glossary/data-center/}

2 https://www.datacenterknowledge.com/archives/2014/10/15/how-is-a-mega-data-center-different-from-a-massive-one

(22)

The CRAC cools the DC room using chilled water and rejects the heat from the IT infrastructure. The heated water from CRAC goes to the chiller, which, in turn, cools the water and sends it back to CRAC, maintaining the chilled water cycle. Another cycle -condenser water cycle - occurs between the chiller and the cooling tower. Once the chiller also uses chilled water to cool the water sent to CRAC, during this cooling process, the water used by chiller warms. To avoid the chiller’s overheating, water is sent to the cooling tower, that dissipates the heat to the external environment, cools water and returns it to the chiller. Other components, such as pipes and pumps are not modeled due to their small impact on the system availability (GOMES et al., 2017).

On the other hand, IT subsystem is composed of computing, storage, and network components. The Network Attached Storage (NAS) comprises storage component. Edge, core, and aggregation routers represent the network components. Finally, racks grouping a set of servers comprise the computing component with a Top-of-Rack (TOR) switch to allow communication between racks and routers. Figure 3 illustrates the DC components.

Figure 3 – Components of a IT DC subsystem (based on (SANTOS et al., 2017))

Servers host applications and these servers are composed of CPUs, Network Interface Card (NIC)s, and RAMs. Racks group dozens of servers, storage, and networking equip-ment, which characterizes the main component to a DC planning concerning cooling and power requirements. The storage system - composed of disk drives or flash devices - per-sistantly store application’s data at servers accessed remotely to provide high-availability. Finally, networking devices connect servers and storage systems and manage all data flow from/to the DC. The link between users and the DC is the edge router, that forwards user requests to core router level, and sends the information to the aggregation router.

(23)

TOR switches receive the requests from aggregation router and forward requests to the servers in the rack that, finally, process the requests.

2.3 SENSITIVITY ANALYSIS

In order to identify the most critical components of a system, sensitivity analysis is a widely adopted technique (OPALSKI, 2015), (LIU et al., 2018), (MATOS et al., 2017),

(LEURENT; PIVANO, 2019). The analysis carried out varies one parameter of the model

per time, while others are fixed, with the purpose to measure its impact on the metric of interest, for example, the time to repair a component can be varied to observe how does it impact availability. Finally, a sensitivity index is generated showing the parameters with higher impact on the metric of interest.

The analysis can be performed using different techniques as differential sensitivity anal-ysis, factorial design, importance factors, percentage difference, among others (HAMBY,

1994). Although the differential analysis is widely used in SPNs and Markov Chains, it may have some problems either in a non-continuous domain or when the model cannot be represented by equations (ANDRADE et al., 2017). To overcome this limitation, the

percentage difference may be applied in these scenarios. Percentage difference calculates the difference between minimum and maximum variations (HAMBY, 1994), as shown in

Equation 2.1. Based on the previous example, 𝑆𝐼𝑝 is the sensitivity index of an analyzed parameter. 𝐷𝑚𝑎𝑥𝑝 and 𝐷𝑚𝑖𝑛𝑚 represent the maximum and minimum output metric values, respectively, achieved from the variation of the parameter value.

𝑆𝐼𝑝 =

𝐷𝑚𝑎𝑥𝑝− 𝐷𝑚𝑖𝑛𝑝

𝐷𝑚𝑎𝑥𝑝

(2.1) 2.4 SURROGATE MODELS

As described previously, to solve an optimization problem through numerical algorithms or other iterative meta-heuristics, a given optimization algorithm runs several times in order to find an optimal solution. These algorithms rely on a fitness function to compare their solutions, and in this work, a fitness function is calculated through a stochastic model. However, a more detailed and representative model is, ordinarily, time-consuming also.

To overcome limitations caused by a model’s complexity, one should derive a simpler model from the original one. This surrogate model, also known as a “model of a model”, exhibits a behavior very similar to the original one (in this case, the stochastic model) but requires a much smaller solving time (WANG et al., 2014a). The creation of a surrogate

model depends on some inputs and outputs gathered from the original one. Afterwards, the surrogate model can estimate new outputs based on new inputs. For example, solve the stochastic model 100 times with different inputs will give 100 different outputs, and

(24)

the surrogate model will to estimate an output given an untried input based on previous 100 input and output values.

This section discusses three strategies, from the literature (YANG et al., 2016), (NAWAR; MOUAZEN, 2017), (JUNIOR, 2018), to build surrogate models: Gaussian Process (GP),

Random Forest (RF), and Gradient Boosting Machine (GBM).

2.4.1 Gaussian Process

Based on past studies (WANG et al., 2014a) (SACKS et al., 1989) (BOOKER, 1998) (LIFSHITS,

2012), the GP stands out as one of the most robust and efficient method to develop accurate surrogate models.

Kriging (KRIGE, 1951) was one of the first methods using Gaussian processes. In the

context of geostatistics, for example, kriging is used for spatial interpolation in scenarios to predict minerals (such as gold, soil nitrogen, biomass etc.) or predict meteorological variables (temperature, pressure etc.) on unsampled locations. In the late 1980’s and due to the increasing demand for computer-based experimentation, a Kriging-based method was developed to build surrogate models from simulators known as Design and Analysis of Computer Experiments (DACE) (SACKS et al., 1989).

Let 𝑥 = (𝑥1, 𝑥2, ..., 𝑥𝑛) be the input values, and 𝑦 = (𝑦(𝑥1), 𝑦(𝑥2), ..., 𝑦(𝑥𝑛)) the output

values. A gaussian process is specified, generally, by a mean function 𝜇(𝑥) = 𝐸[𝑦(𝑥)] and a covariance function also called kernel function 𝑘(𝑥, 𝑥′_{), which correlates two inputs.} 𝐾 comprises a correlation matrix of tried x inputs based on 𝑘(𝑥𝑖, 𝑥𝑗)1≤𝑖,𝑗≤𝑛(SEEGER,

2004). The vector 𝑘(𝑥*) = (𝑘(𝑥*, 𝑥1), 𝑘(𝑥*, 𝑥2), ..., 𝑘(𝑥*, 𝑥𝑛)) comprises the covariances of

an untried input with all other inputs. Assuming that a untried 𝑦(𝑥*) could be predicted

from a probability distribution 𝑃 (𝑦(𝑥*)|𝑦(𝑥), 𝑥*) with mean 𝜇(𝑥) = 0 and variance 𝜎2𝐼,

it is possible to predict (WILLIAMS; BARBER, 1998): 𝑦(𝑥*) = 𝑘𝑇(𝑥*)(𝐾 + 𝜎2𝐼)−1𝑦,

𝜎2

𝑦(𝑥*) = 𝑘(𝑥*, 𝑥*) − 𝑘𝑇(𝑥*)(𝐾 + 𝜎2𝐼)

−1_𝑘_(𝑥 *).

Regarding kernel functions, there is a set of widely functions used in literature, such as exponential, polynomial, and Radial Basis Function (RBF). In this work, we use the polynomial kernel due to its good results with a reasonable number of features, which mitigates the overfitting probability. The polynomial kernel function comprises:

𝑘(𝑥, 𝑥′) = (1 + 𝑥 · 𝑥′)𝑑 (2.2)

where 𝑑 is the polynomial degree (in this work, 2).

Figure 4 shows a graphical example with a GP model representing a cosine function. Six sampled outputs calculated by the original model from inputs [1,3,5,6,7,8] were gath-ered. A prediction is considered more accurate when it is closer of a sampled output value reflecting a high correlation with the original output value. On the other hand, a predicted

(25)

output far distant from the sampled outputs leads to low correlations and, consequently, inaccurate results.

Figure 4 – A GP regression model example

2.4.2 Random Forest

RF has a widely adoption both in regression and classification problems in machine learn-ing. According to (BREIMAN, 2001), Random Forest consists of a tree collection that,

depending on the output variable, may act as a classifier or a regression model, in which each tree is independent and outputs a result. Figure 5 shows a tree example, the basic unit of a random forest. Figure 5(a) shows the prediction function 𝑦(𝑥) that generates the regression tree in Figure 5(b), where an 𝑥, known as split value, divides the tree into two sides. In this example, the first split occurs at 𝑥 = 6, which puts the 𝑦(𝑥) results obtained when 𝑥 ≤ 6 at left split and the other results in the right split. In next nodes, the split occurs again with 𝑠 = 3 and 𝑠 = 8 at left and right nodes, respectively. Finally, the regression tree reaches at terminal nodes, i.e., the nodes that contains the results.

In classification, the most-voted class, i.e., in which the majority of the trees classified it, is predicted. In regression, the average value of all trees comprises the predicted result. In this work, the RF is applied in a regression task, as shown in Figure 6.

To produce a RF, a bootstrap sample is required to each tree. Bootstrapping consists in selecting random entries with repetition and create a new entry set of the same size than the original set (FRIEDMAN; HASTIE; TIBSHIRANI, 2001). Suppose a model with entries 𝐸 = {(𝑥1, 𝑦(𝑥1)), (𝑥2, 𝑦(𝑥2)), ..., (𝑥𝑖, 𝑦(𝑥𝑖))} with 𝑖 = 𝑁. Bootstrapping generates a new

set of size 𝑁 based on original entries, selected with replacement. Supposing N=3 and a entry set 𝐸𝑒𝑥 = (1, 2), (2, 3), (3, 4), a possible new set after bootstrapping would be

(26)

(a) (b)

Figure 5 – A regression tree example. In (a), the prediction function y(x); In (b), the regression tree based on splits.

Figure 6 – An RF model example (based on <https://dsc-spidal.github.io/harp/docs/ examples/rf/)>

According to (FRIEDMAN; HASTIE; TIBSHIRANI, 2001), we need to define the number

of trees 𝐵. Then, for each three, a bootstrapping set is generated, and the three is built recursively until min depth is reached, following the steps:

• Among 𝑛 variables of input 𝑥, select 𝑚 variables among them; • Select the best variable/split among 𝑚 variables;

• Create two daughter nodes

These steps generate a RF with some trees in a similar process as illustrated in Figure 6 in order to get the most-variated results. After creating all trees, represented by ensemble

𝑇, the prediction at a untried 𝑥* input indicates a result of an untried input 𝑦(𝑥*) via

regression is the average of all tree predictions, defined as:

𝑦(𝑥*) = 1 𝐵 𝐵 ∑︁ 𝑏=1 𝑇𝑏(𝑥*) (2.3)

(27)

The 𝑦(𝑥*) indicates a result of an untried input 𝑥*. 𝐵 consists in the set of trees, and 𝑇𝑏 the tree in index 𝑏 that generates an output based on untried input 𝑥*.

2.4.3 Gradient Boost Machine

The main goal of the supervised machine learning lies on reducing the model error in relation to a training set through known parameters. GBM, also known only as Gradient Boosting, aims to leverage the model errors fitting them in another model in order to improve the accuracy and reduce variance. GBM predicts regression and classification tasks and may be coupled as a strategy to reduce error in other models (FRIEDMAN; HASTIE; TIBSHIRANI, 2001).

Let 𝑓(𝑥) a GBM regression tree model, with loss function 𝐿 = 1 2

∑︀𝑁𝑖=1(𝑦𝑖 _{− 𝑓}(𝑥𝑖))2,

where 𝑁 comprises the number of inputs and 𝑥𝑖 the input at 𝑖𝑡ℎ iteration. We can define

a number of iterations 𝑀 in order to minimize the residuals. In the first iteration, 𝑓0(𝑥),

the model may generate a high biased model, which gives many errors. The next step consists in calculating the residuals for each output in 𝑚𝑡ℎ _{iteration based on Eq. 2.4} (FRIEDMAN; HASTIE; TIBSHIRANI, 2001) :

𝑟𝑖𝑚 = −

[︁𝜕𝐿(𝑦𝑖, 𝑓(𝑥𝑖)

𝜕𝑓(𝑥𝑖)

]︁

𝑓 =𝑓𝑚−1 (2.4)

Eq. 2.4 based on the example loss function returns 𝑟𝑖𝑚 = 𝑦𝑖 − 𝑓(𝑥𝑖). After getting the residuals, GBM fits a regression tree model ℎ𝑚 = 𝑓𝑚−1(𝑟𝑖𝑚, 𝑥), i.e., a model based on 𝑓𝑚−1(𝑥𝑖) residuals. Finally, there is the update of 𝑓𝑚 function, as shown in Eq. 2.5.

𝑓𝑚 = 𝑓𝑚−1(𝑥) + ℎ(𝑥) (2.5)

An intuitively GBM application is illustrated in Table 2 and Figure 7. Let 𝑓(𝑥) be a regression tree that outputs the split mean if this split contains 𝑥𝑖. Let 𝑦(𝑥) be the output values of the training set. The first step consists in generating 𝑓0(𝑥), as presented

in Table 2. The regression tree, hence, divides the output values into two splits whether 𝑥 is greater than six or not. Therefore, 𝑓0(𝑥) predicts the split mean if an input 𝑥 satisfies

a condition. For example, if 𝑥 = 6, 𝑓0(𝑥) = 5.3 due to 𝑥 ≥ 6 and 5.19+5.34+5.4₃ = 5.3. On

the other hand, if 𝑥 = 3, 𝑓0(𝑥) = 15.56. Figure 7(a) presents a graphical result.

The next step starts the iteration 1, comprising the residuals’ calculus (𝑟𝑖𝑚, in this case, as 𝑥 is a single variable, 𝑟𝑚) that subtracts 𝑦(𝑥) from 𝑓0(𝑥). Then, ℎ𝑚(𝑥) predicts

the function 𝑓𝑚−1 replacing the 𝑦(𝑥) output values to 𝑟𝑚. Therefore, ℎ1(𝑥) predicts −4.80

for 𝑥 values below 3 and 2.4 otherwise. Notice that −4.80 and 2.40 are means of the 𝑟1(𝑥)

values that satisfies split conditions. The iteration 1 finishes when 𝑓1(𝑥) receives the sum

of 𝑓0(𝑥) and ℎ(𝑥). The graphical result of 𝑓1(𝑥) is illustrated in Figure 7(b) based on 𝑟1(𝑥)

(28)

Table 2 – A simple GBM regression with values referring to three first iterations 𝑥 𝑦(𝑥) 𝑓0(𝑥) 𝑟1(𝑥) ℎ1(𝑥) 𝑓1(𝑥) 𝑟2(𝑥) ℎ2(𝑥) 𝑓2(𝑥) 𝑟3(𝑥) ℎ3(𝑥) 𝑓3(𝑥) 0 10.83 15.56 -4.72 -4.80 10.76 0.07 1.20 11.96 -1.12 -1.20 10.76 1 11.44 15.56 -4.11 -4.80 10.76 0.68 1.20 11.96 -0.51 -1.20 10.76 2 10.00 15.56 -5.56 -4.80 10.76 -0.76 1.20 11.96 -1.96 -1.20 10.76 3 20.60 15.56 5.04 2.40 17.96 2.64 1.20 19.16 1.44 0.60 19.76 4 20.29 15.56 4.73 2.40 17.96 2.33 1.20 19.16 1.13 0.60 19.76 5 20.18 15.56 4.62 2.40 17.96 2.22 1.20 19.16 1.02 0.60 19.76 6 5.19 5.30 -0.11 2.40 7.71 -2.52 -2.40 5.31 -0.12 0.60 5.91 7 5.34 5.30 0.04 2.40 7.71 -2.36 -2.40 5.31 0.036 0.60 5.91 8 5.40 5.30 0.1 2.40 7.71 -2.31 -2.40 5.31 0.087 0.60 5.91

Table 2 also shows the procedures related to iterations 2 and 3, shown graphically in Figures 7(c) and 7(d) respectively. GBM then converges after a reasonable number of iterations, as shown in Figures 7(e) and 7(f) with 9 and 18 iterations, respectively.

Although this GBM example has one tree, the method generally has hundreds of trees. The major difference to RF consists in boosting instead of bagging, which generate trees based on residuals of previous trees. GBM may have a shrinkage factor 𝑣, that decreases the error function weight. In general, 𝑣 = 0.1, and it updates Eq. 2.5 multiplying ℎ𝑚(𝑥), as shown in Eq. 2.6.

(29)

(a) (b) (c) (d) (e) (f)

Figure 7 – A GBM regression example. In (a), the prediction function y(x); In (b), the residuals (errors) of each prediction. Images based on <https://www.kaggle. com/grroverpr/gradient-boosting-simplified/notebook>

(30)

2.5 LATIN HYPERCUBE SAMPLING

The surrogate model accuracy depends on the sampling technique used to recover data from the original model. A clusterized sampling would result in an unfitted surrogate model due to high variation in uncovered areas. So, a sparse sampling strategy would increase the chance of covering the whole original model area. A widely adopted strategy to sample data for surrogate modeling is the LHS (CHU et al., 2015) which divides the

range of each variable into disjoint intervals of equal probability, choosing one value of each interval (FLORIAN, 1992).

Consider 𝑆 a matrix composed of 𝑁 samples and 𝐾 variables. 𝑆 contains values between 0 and 1 and represents the basic sampling plan. As an example, let 𝑁 = 5 and

𝐾 = 2. A possible matrix 𝑆, where each column comprises sampling values of a variable

would be: 𝑆 = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 0.31 0.25 0.15 0.64 0.48 0.97 0.79 0.04 0.89 0.41 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (2.7)

This matrix 𝑆 generates 𝑁 points in a 𝐾-dimensional plan, as represented in Figure 8, where each axis comprises a model variable. The Latin algorithm divides the plan in an

𝑁 × 𝑁 grid, where each point comprises a particular region. In this example, as 𝐾 = 2,

the plan consists in a Latin Square grid containing, at least, one point in each row and in each column, indicating that even though a sparse sampling, it covers all rows and columns in a given plan. The Latin Hypercube extends the Latin Square concept to a arbitrary number of dimensions. For more details about LHS technique, see (FLORIAN,

1992) (OLSSON; SANDBERG; DAHLBLOM, 2003).

An improvement on LHS technique, called optimum LHS (OLHS) (STOCKI, 2005),

achieves better results in comparison to the basic LHS technique. For each point, OLHS aims to minimize the distance between the points using the Columnwise-Pairwise (CP) algorithm. CP interchanges two samples in a column and re-calculates the distance be-tween points through the inverse of the sum of the distances bebe-tween them. The best swap, i.e., that had the best minimization result makes CP maintains the swapped pair in the same column positions and goes to the next one. The loop persists until CP reaches the maximum number of sweeps (how many times CP acts in each column) or satisfies the optimal stopping criterion (a decimal value in which the inverse of the sum of the distance points need to reach).

After finishing LHS, each variable fits the sampling points (between 0 and 1) into its probability distribution to map the LHS sampling into variable range values (OLSSON;

(31)

Figure 8 – Latin Square plan example. LHS extends the concept for an arbitrary number of dimensions.

SANDBERG; DAHLBLOM, 2003). As can be seen, LHS allows a considerable area coverage

and does not select closer observations. The combination of LHS for sampling and surro-gate models is a reasonable strategy to establish accurate models with less complexity. 2.6 OPTIMIZATION BASED ON EVOLUTIONARY ALGORITHMS

Optimization is the process of finding best solution(s) to a problem, taking into account possible constraints (WANG et al., 2014b). Optimization can be applied in several real-world

scenarios: ranging from operations research, physical applications, product design, and so on (WU; MALLIPEDDI; SUGANTHAN, 2016). However, finding an optimal solution to a large

and complex problem is often a computationally hard task. As a result approaches based on Evolutionary Algorithms (EA) (STREICHERT, 2002) combine meta-heuristic algorithms

with some problem-specific knowledge. These algorithms are based on evaluating a range of several solutions (population) using a function (commonly called fitness function) to choose the optimal solution, and producing new candidate solutions (next generation) through a specific strategy provided by the meta-heuristic. Figure 9 shows the general process of evolutionary algorithms.

The next sections discuss three EAs: Genetic Algorithm, Non-dominated Sorting Ge-netic Algorithm, and Differential Evolution.

2.6.1 Genetic Algorithm

GA aims to find the best solution through concepts based on Darwin’s evolution theory (GOLDBERG, 1989), the natural selection. Let 𝐷 be the number of parameters in a model.

GA generates 𝑁 random vectors with length 𝐷 comprising the initial population of solutions. Next, GA selects the best solutions found so far in order to replicate them to

(32)

Figure 9 – General Evolutional Algorithms process. Based on (STREICHERT, 2002)

the next generation and discard the worst ones, which characterizes the reproduction process.

Table 3 shows an example of the reproduction process. GA generates a random popu-lation with 𝑁 = 4. Each popupopu-lation vector has 𝐷 = 5 parameters represented by binary digits. The 𝑋𝑖 vector, with 𝑖 = 1, 2, 3, 4, comprises a unsigned integer number in binary notation, as shown in the next column. Then, GA applies a function for evaluating this entry, called fitness function. In this example, 𝑓(𝑥) = 𝑥2_{− 𝑥}_{, which generates the values}

in this column. The next column depicts the probability of maintain the 𝑋𝑖 vector based on fitness results. The calculus divides the results of each fitness result by the sum of fitness outputs. Afterward, GA calculates how many vectors would be replicated in the next reproduction, dividing 𝑝𝑋𝑖 by the average of probabilities (in this case, 0.25). Sup-posing a round integer at the expected count value, GA replicates one vector of 𝑋1 and 𝑋3, and two vectors of 𝑋4, which replaces the vector 𝑋2 once it does not reach a good

fitness result.

After replication, GA chooses pairs randomly to make the crossover step. GA gen-erates a random number between 0 and 1 and compares with the crossover probability (𝐶𝑃 ). Figure 10 illustrates the process. GA generates a random number between (1,𝐷) to estimate a split index. The split will interchange the 𝑋1 and 𝑋3 indexes from this split

until the last index. Therefore, if split=4, crossover puts the fourth and fifth indexes of

(33)

Table 3 – A simple reproduction step with GA. Based on (GOLDBERG, 1989) i Initial Random Population Integer value (x) Fitness function (𝑥2_{− 𝑥}₎ p keep x Expected count Population after replication 1 10010 18 306 0.32 1.28 (∼ 1) 10010 2 00101 5 20 0.02 0.0008 (∼ 0) 10110 3 01101 13 156 0.17 0.68 (∼ 1) 01101 4 10110 22 462 0.49 1.96 (∼ 2) 10110 Sum 944 1.00 4 Average 236 0.25 1

vector with indexes of 𝑋1.

Figure 10 – GA crossover for 𝐷 = 5 parameters and pair (𝑋1,𝑋3).

Finally, the last step may change a parameter after the crossover step. This step, called

mutation, changes a parameter value if a random number is smaller than the mutation

probability (𝑀𝑃 ). This way, for each parameter index, GA generates a random value and makes the mutation if necessary. In a binary set, the mutation changes the bit (0 to 1 and

vice versa). In a real domain, the new value comes from a random uniform distribution

between given lower and upper limits. After mutation, the generation finishes and a new population grows in order to repeat the steps for the next generation.

Moreover, GA applies a technique to converge faster, called elitism. Elitism saves a population percentage with the best results to the next generation, in order to maintain good estimates when a new population grows in the next generation. Therefore, elitism ensures that good candidates will continue and have a higher probability to generate still better candidates. GA, in general, requires a high 𝐶𝑃 (between 0.8 and 1), a low 𝑀𝑃 (between 0.001 and 0.2), and elitism rate of 5%.

2.6.2 Non-dominated Sorting Genetic Algorithm

Non-dominated Sorting Genetic Algorithm II (NSGA-II), a multi-objective genetic al-gorithm (MOGA), aims to improve the elitism technique (DEB et al., 2002) discussed in

(34)

section 2.6.1, i.e., in a multi-objective design, NSGA-II tries to maintain even more good solutions for the next generations than GA.

Let 𝑃 a population with size 𝑁. NSGA-II generates a new population 𝑃′ _{in a similar}

way than GA through reproduction, crossover, and mutation steps. NSGA-II, then, joins the populations 𝑃 and 𝑃′ _{to identify the best solutions and keeps them for the next}

generations.

NSGA divides the best populations into groups called fronts. Each front gets the less dominated solutions, i.e., the solutions with best results that dominate the other ones. Table 4 illustrates an example of the fast-non-dominated algorithm with a population of five individuals based on results of Figure 6. The fast-non-dominated algorithm compares each fitness solution with the other ones and generates a domination counter, i.e., if this solution finds a fitness higher than another, the domination counter increases by 1 (DEB et al., 2002). At the end of these pairwise comparisons, a non-dominated solution is

identified if its domination counter equals to zero and belongs to the first front (𝐹 1). The algorithm puts the dominated solutions into subsequent fronts 𝐹 2, 𝐹 3, and so on. The sorting ensures the first fronts as the less non-dominant solutions.

Table 4 – An example of fast-non-dominated sort algorithm with 𝑁 = 5. x Fitness Domination Counter Fi 1 5 0 F1 2 5 0 F1 5 20 4 F3 7 10 2 F2 8 10 2 F2

In this example, the solutions with 𝑥 = 1, 2 achieved the lower fitness values for an objective, obtaining domination counter equal to zero and belonging to front 𝐹1. The

solutions 𝑥 = 7 and 𝑥 = 8 have two domination counters each (both are higher than fitness solutions for 𝑥 = 1 and 𝑥 = 2). Fast-non-dominated algorithm, thus, puts these solutions into 𝐹2, and the solution 𝑥 = 5 into 𝐹3 in a similar way. The name fast comes

from complexity 𝑂(𝑀𝑁2_{), where 𝑀 comprises the number of objectives.}

Figure 11 shows the NSGA-II procedure. As discussed previously, NSGA-II puts the

𝑃 and 𝑃′ sets together and performs the sorting of the solutions into fronts 𝐹 1,𝐹 2,𝐹 3,

and 𝐹 4. Let 𝑁 = 8, NSGA-II keeps the first fronts in the next generation until reach the population size. In this example, the front 𝐹 1 - the most dominant, in red - pass to next generation, as well as 𝐹 2, in yellow (front that only no dominates 𝐹 1). On the other hand, a solution belonging to 𝐹3 does not stay in the next generation once its addition would

surpass the maximum population size. NSGA-II discards the other solutions. Therefore, NSGA-II tries to maintain the best solutions of previous generations for muti-objective

(35)

optimizations, which gives the “fast and elitist” algorithm name (DEB et al., 2002). For

more details, see (DEB et al., 2002) and (KONAK; COIT; SMITH, 2006).

Figure 11 – NSGA procedure example for a population with size 𝑁 = 8. Based on (DEB et al., 2002)

2.6.3 Differential Evolution

Differential Evolution (DE) applies an evolutionary strategy based on mutation and crossover of a population (STORN; PRICE, 1997). DE characterizes a simpler, but powerful

minimization algorithm compared to GA due to high randomness, inexpensive arithmetic operators, and good solution convergence (TANG; ZHAO; LIU, 2014).

Let 𝐷 the set of parameters in a model, and 𝑁 the population size. DE receives the minimum and maximum values of each parameter 𝑑 ∈ 𝐷 and generates, randomly, 𝑁 vectors with size 𝐷. DE, then, saves the vector with the best solution, called target (𝐴𝑖, 𝐺) with index 𝑖 = 1, 2, 3, ..., 𝑁 and generation 𝐺, in order to keep some parameter

from it in the crossover step. Thus, DE performs mutation: it selects three vectors with indexes 𝑟1, 𝑟2, and 𝑟3 randomly to generate a mutant vector 𝑀𝑖, 𝐺+ 1 for the next

generation. Eq. 2.8 describes the trial vector mutation (STORN; PRICE, 1997).

𝑀𝑖,𝐺+1 = 𝑋𝑟1,𝐺+ 𝐹 × (𝑋𝑟2,𝐺− 𝑋𝑟3,𝐺) (2.8)

𝐹 constant, generally with a value between 0 and 2, controls the impact of 𝑋𝑟₂,𝐺 and

𝑋𝑟3,𝐺 vectors. The next step comprises the crossover. DE creates a trial vector 𝐶𝑖,𝐺+1

based on mutant and target vectors, as shown in Eq 2.9. DE maintains at least one parameter of the mutant vector chosen randomly (𝑗 = 𝑟𝑖𝑑𝑥(𝐶𝑖,𝐺+1)) to, indeed, have a trial vector different of previous target vector. For all 𝐷 parameters, DE verifies two conditions: if 𝑗 index equals to 𝑟𝑖𝑑𝑥(𝐶𝑖,𝐺+1), maintains the mutant index 𝑗 value to trial

(36)

vector; otherwise, it generates a random number between 0 and 1 and compares it with a crossover constant 𝐶𝑅, that also has a value between 0 and 1. If the random value is smaller than 𝐶𝑅, DE chooses the value of the mutant vector. On the other hand, if 𝐶𝑅 is greater than a random value, 𝐶𝑗𝑖,𝐺+1 keeps the value of target value referring to 𝑗 index.

𝐶𝑗𝑖,𝐺+1 = ⎧ ⎪ ⎨ ⎪ ⎩ 𝑀𝑗𝑖,𝐺+1, 𝑖𝑓(𝑟𝑎𝑛𝑑(𝑗) ≤ 𝐶𝑅)𝑜𝑟𝑗 = 𝑟𝑖𝑑𝑥(𝑀𝑖,𝐺+1) 𝐴𝑗𝑖,𝐺, 𝑖𝑓(𝑟𝑎𝑛𝑑(𝑗) > 𝐶𝑅)𝑜𝑟𝑗 ̸= 𝑟𝑖𝑑𝑥(𝑀𝑖,𝐺+1) 𝑗 = 1, 2, ..., 𝐷 (2.9)

Figure 12 shows the crossover process with 𝐷 = 5 parameters. The trial vector gets at least one mutant vector parameter and some parameter from target if the random number generated at 𝑗 index is higher than 𝐶𝑅. After the generation of a trial vector, DE compares the result with target vector and, if trial vector minimizes the function in comparison to target, trial vector becomes target vector in the next generation. The whole process repeats until reach the maximum number of generations 𝐺.

Figure 12 – DE crossover for 𝐷 = 5 parameters. DE selects one random parameter from

𝑀𝑗𝑖,𝐺+1(in this example, the second) in order to differ trial and target vectors. Based on (STORN; PRICE, 1997)

2.7 RELATED WORK

This chapter presents some works related to cooling subsystem, sensitivity analysis, sur-rogate modeling and optimization algorithms based on evolutionary strategies. Regarding cooling subsystem, Callou et al. (CALLOU et al., 2012), (CALLOU et al., 2013) also develop

models of a cooling subsystem using SPNs. Their subsystem components are not replica-ble in a straightforward manner, which may lead to proreplica-blems related to scalability and state explosion(VALMARI, 1998). The authors evaluate five cooling architectures, reaching

almost five nines (99.999%) of availability in the most redundant model. These works do not integrate the cooling and IT subsystems and do not perform sensitivity analysis.

(37)

Koo, et al. (KOO; CHUNG; KIM, 2015) study the impact of an architecture with an (n-k)

way CRACs, where n CRACs are active and k in standby mode. The researchers create a transition diagram to represent the whole system behavior, in a structure similar to a Markov Chain. However, they analyze only CRAC availability, disregarding other cooling components. The results show that the optimum allocation is obtained by a (3-3) way CRACs, which reaches seven nines of availability with the smaller number of components. The researchers do not perform sensitivity analysis and do not evaluate the impacts of cooling on the IT components.

Souza, et al. (SOUZA et al., 2013) evaluate the effects on temperature variation in

IT systems. The authors suggest the Arrhenius equation to estimate the impact of the cooling subsystem failures in IT equipment and analyze the cost of failures in downtime and revenue loss. However, the authors do not make a sensitivity analysis to study the most critical components in cooling and IT subsystems.

Alissa, et al. (ALISSA et al., 2016) carry out experiments to provide some insights

concerning the cooling of IT components. The authors state the AU metric, defined as the time between a cooling failure and the auto power off of servers due to overheating. The evaluations show that AU is 21 minutes for DCs without cold aisle. The work focuses on measurements, and therefore the authors do not undertake a sensitivity analysis and do not measure the overall system’s availability.

In a previous work (GOMES et al., 2017), we evaluated the availability of a cooling

subsystem and studied the impact of CRACs rotation instead of a cold-standby strategy. This work uses a deterministic distribution to estimate the rotation of CRACs in a cool-ing subsystem, in which a CRAC turns off after a 12-hour period. The work considers pipe and pump cooling components, which increases the complexity. The presented re-sults performed by a simulation show little difference between rotation and cold-standby strategies.

Some studies performed sensitivity analysis (SANTOS et al., 2017), (ANDRADE et al.,

2017), (SILVA et al., 2018). The first one models the IT environment through Petri Nets

and performs its sensitivity analysis, resulting in the edge router as the most sensible component in a DC. The second work applies two different sensitivity techniques (per-centage difference and partial derivatives) in a disaster-recovery DC scenario using Markov Chains. The latter also uses Markov chains in a disaster-recovery DC but applies only partial derivative technique. None of these works considers cooling components in the DC environment.

Regarding evolutionary strategies, (DEZANI et al., 2014) apply GA in a traffic light

problem. The authors use Petri Nets as fitness function to find the best route for each ve-hicle aiming to decrease the route time. The results outperformed the Djikstra algorithm. This work, in contrast to (DEZANI et al., 2014), uses a surrogate modelling approach as

(38)

model. Other authors (DELGARM et al., 2016) study the NSGA-II algorithm to improve

the energy efficiency of buildings. Based on building orientation, window height and lo-cation, the authors can estimate the annual cooling and lighting electricity and use them as a multiobjective problem in order to find the best optimum value of a building to be constructed.

Other authors (LIU; ZHANG; GIELEN, 2014) use surrogate modeling and evolutionary

strategies to search problems. They use GP as surrogate model in a low-dimensional space (about 20-50 variables). The work applies DE as the optimization algorithm. Results show that the proposed solution outperforms other literature proposals that use GA for single-objective optimization (LIM et al., 2010).

Some works compare different surrogate models. In (YANG et al., 2016), authors study

RF and Boosted Regression Trees (BRT), a very close concept in comparison to GBM, as regression techniques to map areas with high carbon concentration in alpine ecosystems. Root Mean Squared Error (RMSE), determination coeficient (𝑅2_{), Mean Absolute Error}

(MAE) to compare the solutions. Results show very similar results between two strategies, with a slightly advantage to BRT in denser vegetation areas. On the other hand, in (NAWAR; MOUAZEN, 2017), there is a comparison between GBM, RF, and Artificial Neural

Networks (ANN) in order to predict areas with nitrogen and carbon. Based on RMSE and 𝑅2, RF outperformed the other surrogate models.

Table 5 compares the related and the present work. In this work, both availability evaluation as the maximization are studied. The aim of this work is to present cooling models in a straightforward manner and that needs less states. Furthermore, it provides a joint evaluation of the cooling and IT subsystems concerning availability and costs related to the acquisition, operation, and downtime. In order to identify the most sensible components, this work applies sensitivity analysis using percentage difference technique to estimate how cooling components impact the DC availability.

Table 5 – Comparison among this and related work

Work Modeling Approach Cooling/IT Modeling Sensitivity Analysis Surrogate models Optimization Algorithms

(CALLOU et al., 2012) SPN Cooling No No No

(KOO; CHUNG; KIM, 2015) RBD Cooling No No No

(SOUZA et al., 2013) SPN Cooling + IT No No No

(ALISSA et al., 2016) Measurement Cooling + IT No No No

(GOMES et al., 2017) SPN Cooling No No No

(SANTOS et al., 2017) SPN IT Percentage Difference No No

(ANDRADE et al., 2017) Markov Chains IT Percentage Difference +

Partial Derivative No No

(SILVA et al., 2018) Markov Chains IT Partial Derivative No No

(DEZANI et al., 2014) SPN No No No GA

(LIU; ZHANG; GIELEN, 2014) Surrogate No No GP DE

(DELGARM et al., 2016) Simulation Cooling No No NSGAII

(YANG et al., 2016) Surrogate No No BRT, RF No

(NAWAR; MOUAZEN, 2017) Surrogate No No RF, GBM, ANN No

(39)

Regarding prediction and optimization, this work applies three surrogate models (GP, RF, and GBM) and compares them to evaluate the best fit through RMSE, 𝑅2, and MAE.

Afterward, there is an evaluation of three optimization strategies (GA, NSGAII, and DE) to achieve the best DC configuration based on limited budget in order to maximize the availability. Therefore, this work joins the DC modeling using SPN with sensitivity analysis and optimization techniques based on surrogate models.

2.8 CONCLUDING REMARKS

In this work, there is the application of several approaches. The modeling will be made in SPNs due to high abstraction level that allows a fine-grained representation of the system. The focus of this work consists in modeling cooling and IT subsystems in order to evaluate the availability and identify critical components in sensitivity analysis. Finally, to maximize the availability, an optimization algorithm based on evolutionary strategy will be applied, however, the model must be solved a thousand of times. To cope this gap, the approach consists in create surrogate models based on SPN model to build cheaper, but confident models to decrease the time of optimization task. The LHS technique helps to select inputs that do not overlap each other, which generates variate outputs and increases the surrogate model’s accuracy.