Orchestration of cloud services with critical components in SKA

(1)

Universidade de Aveiro Departamento de Eletrónica,Telecomunicações e Informática

2020

DZIANIS

BARTASHEVICH

Orquestração de Serviços Cloud com

Componentes Críticos no SKA

Orchestration of Cloud Services with Critical

Components in SKA

(2)

(3)

“The greatest challenge to any thinker is stating the problem in a way that will allow a solution”

— Bertrand Russell

2020

DZIANIS

BARTASHEVICH

Orquestração de Serviços Cloud com

Componentes Críticos no SKA

Orchestration of Cloud Services with Critical

Components in SKA

(4)

(5)

2020

DZIANIS

BARTASHEVICH

Orquestração de Serviços Cloud com

Componentes Críticos no SKA

Orchestration of Cloud Services with Critical

Components in SKA

Dissertação apresentada à Universidade de Aveiro para cumprimento dos requisitos necessários à obtenção do grau de Mestre em Engenharia Infor-mática, realizada sob a orientação científica do Doutor João Paulo Barraca, Professor auxiliar do Departamento de Eletrónica, Telecomunicações e In-formática da Universidade de Aveiro, e do Doutor (co-orientador) Domingos Barbosa, investigador do Instituto de Telecomunicações da Universidade de Aveiro.

Este trabalho foi apoiado por ENGAGE SKA,

Financiado pelo Programa Opera-cional Competitividade e

(6)

(7)

Interna-o júri / the jury

presidente / president Prof. Doutor Joaquim Manuel Henriques de Sousa Pinto

professor auxiliar da Universidade de Aveiro

vogais / examiners committee Doutor Nuno Pedro de Jesus Silva

Technical Manager, Critical Software

Prof. Doutor João Paulo Barraca

(8)

(9)

agradecimentos / acknowledgements

This work would not have been possible without the contribution and help of many people and entities, to which I have to thank.

Firstly, I would like to thank PhD Professor João Paulo Barraca for giv-ing me the opportunity of workgiv-ing in my field of interest, with which I was able to learn and grow immensely. I thank for the guidance and motivation throughout this year.

A big thanks to the rest of the research group, namely my co-supervisor PhD Domingos Barbosa and PhD Professor Miguel Bergano, for providing a supportive environment with which I learned every day.

To my girlfriend, Inês Tavares, who has gone beyond her capabilities to help me through the most difficult times and to always encourage me to be better, not give up and lose faith in myself.

To my Mother for providing me with this wonderful opportunity to grow and have a purposeful life. For being role model and my biggest supporter. And to my grandparents, their pride and faith in me will never be forgotten and will always push me to be the best professional and overall person I can be. This work was supported by Enabling Green E-science for the Square Kilometre Array Research Infrastructure (ENGAGE SKA), POCI-01-0145-FEDER-022217 and funded by Programa Operacional Competitividade e Internacionalização (COMPETE 2020) and FCT, Portugal.

(10)

(11)

Palavras Chave computação em nuvem, openstack, virtualização, sla, monitorização, autom-atização, orquestração, métodos de disponibilidade, estratégias de disponibil-idade, métodos de recuperação.

Resumo Esta dissertação propõe métodos de alta disponibilidade para aplicações críti-cas, a fim de manter a sua função normal e se recuperar de falhas inesper-adas. As aplicações podem ser desenvolvidas e alojadas para trabalhar no ambiente de nuvem para obter flexibilidade na manutenção, oferecendo tam-bém a opção de monitorização. Um sistema de monitorização pode vigiar as métricas do sistema, como o uso de CPU ou apenas um serviço de aplicativo específico, esteja ele em execução ou não. Além disso, a criação de alarmes no sistema de monitorização permite acionar a notificação sobre uma ocor-rência não esperada de evento, ajudando o orquestrador a recuperar a situ-ação do estado critico. A ocorrência da falha pode acontecer quando uma determinada métrica está acima do limite estabelecido, onde o SLA (Service Level Agreement) é violado. A solução implementada e testada usa a nuvem privada OpenStack como suporte à infraestrutura e, por meio do orquestrador Heat, do sistema de monitorização TICK Stack e de um mecanismo de re-cuperação, fornece uma solução capaz para o monitorizar o estado das apli-cações, oferecendo alta disponibilidade. Os resultados do teste provaram que a solução é capaz de recuperar o serviço em diferentes cenários de teste, indicando os limites de monitorização do sistema e recuperar o serviço em tempo aceitável sem comprometer outros serviços.

(12)

(13)

Keywords cloud computing, openstack, virtualization, sla, monitoring, automatic deploy-ment, orchestration, availability mechanisms, availability strategies, recovery methods.

Abstract This dissertation proposes methods of high-availability for critical applications to maintain their normal function and recover from unexpected failures. Appli-cations can be developed and deployed to work within the cloud environment to achieve flexibility in maintenance, also giving the option of monitorization. A monitoring system can monitor system metrics like CPU usage or just a spe-cific application service, whether is it running. Additionally, creating alarms within the monitoring system, allowing to trigger notification upon a failure event occurrence helping the orchestrator to failover. The failure occurrence can happen when a certain metric is above the established threshold where the Service Level Agreement (SLA) is violated. The implemented and tested solution uses OpenStack private cloud as infrastructure support, and through use of the Heat orchestrator, TICK stack monitoring system, and a recovery engine provided with a capable solution for critical application monitoring, pro-viding high-availability. The test results proved the solution worth in different test scenarios indicating monitoring limits of the system and showed the ser-vice recovery time to be reasonable without compromising other serser-vices.

(14)

(15)

List of Figures

2.1 Cloud environment architecture. . . 6

2.2 Comparison among three cloud service models . . . 7

2.3 Hypervisors types . . . 9

2.4 Hypervisor performance comparison [11] . . . 10

3.1 Verifying host state at a certain monitoring interval . . . 27

3.2 Swap memory usage by Nexus OSS . . . 31

3.3 IOSTAT real-time disk statistics . . . 32

3.4 Free memory tool in human-readable output . . . 33

3.5 Content of file /proc/meminfo split through four images . . . 33

3.6 Terminal output of mpstat tool of a virtual machine with 46 vCPU . . . 34

3.7 netstat terminal output . . . 37

3.8 iperf3 terminal output . . . 37

3.9 Solution architecture . . . 40

3.10 Service deployment workflow . . . 47

3.11 Stack and SLA workflow from service deployment . . . 48

3.12 Service recovery workflow . . . 49

4.1 OpenStack module distribution . . . 53

4.2 Service manager implementation architecture . . . 59

4.3 Example of a Chronograf dashboard . . . 66

4.4 Monitoring dashboard of SLA availability percentage . . . 67

4.5 Monitoring dashboard of metric statistics . . . 67

4.6 Class diagram of the implemented solution and related classes . . . 69

4.7 Flow of the action insideapi_post() function . . . 71

(20)

5.1 EngageSKA cluster at the Datacenter of Telecommunication Institute in Aveiro 77

5.2 Simultaneous service stack deployment . . . 78

5.3 Kapacitor delay to process metrics . . . 79

5.4 Duration of each step from service deployment . . . 80

5.5 Deployment time over number of services . . . 82

5.6 Detection time of a SLA violation among different monitoring intervals . . . 83

5.7 Recovery method comparison . . . 85

5.8 Service recovery time with different quantity of services . . . 86

5.9 Throughput at monitoring system according to different monitoring intervals . . 87

(21)

List of Tables

2.1 Classes of system availability [14] . . . 12

2.2 Comparison of network monitoring options [17] . . . 14

2.3 Monitoring solution comparison . . . 17

2.4 Comparison of migration techniques [22] . . . 18

2.5 Orchestration system comparison . . . 24

3.1 External SLA monitoring . . . 43

(22)

(23)

Glossary

SKA Square Kilometer Array

SLA Service Level Agreement

EngageSKA ENAbling Green E-science for SKA

LMC Local Monitoring and Control

IaaS Infrastructure-as-a-Service

PaaS Platform-as-a-Service

SaaS Software-as-a-Service

AWS Amazon Web Services

HVAC Heating, Ventilation, and Air Conditioning

VM Virtual Machine

OS Operating System

KVM Kernel-based Virtual Machine

CPU Central Processing Unit

vCPU Virtual CPU

IO Input/Output

YAML YAML Ain’t Markup Language

HOT Heat Orchestration Template

SDP Science Data Processor

RAM Random Access Memory

TICK Telegraf, InfluxDB,

Chronograph and Kapacitor

RTT Round Trip Time

DSL Domain Specific Language

PTP Precision Time Protocol

CI/CD Continuous Integration and Continuous Delivery

QoS Quality of Service

SNMP Simple Network Management Protocol

HA Highly-Available

FT Fault Tolerant

SMP-FT Symmetric for Multi-Processor Fault Tolerance

IOPS Input/output operations per second

BSOD Blue Screen Of Death

TTL Time-To-Live

WAN Wide Area Network

MTU Maximum Transmission Unit

UPS Uninterruptible Power Supply

NTP Network Time Protocol

HPC High-Performance Computing

RAID Redundant Array of Inexpensive Drives

PSU Power Supply Unit

NFS Network File System

(24)

(25)

CHAPTER

1

Introduction

This dissertation work is intended to develop mechanisms for service deployment and high-availability, using a monitoring system to maintain the service working accordingly to Square Kilometer Array (SKA) requirements. The SKA project uses Tango devices, which require special attention. A Tango device can control dishes, arrays of antennas, and more. The availability of those devices is critical, and software that controls them cannot fail.

Moreover, the cloud environment allowed projects like SKA to quickly provision the software, either for test or production purposes. However, the cloud environment does not possess a way to monitor the deployed software. In case of a failure, there are no alerts about it. Even if someone notices the failure, it will still require manual attention. With all these issues, the software will suffer a high amount of downtime. In some scenarios, this can be critical. For example, failure of the emergency mechanisms to position the dish in a safe position under high wind speed may result in a dish falling to the ground.

To achieve the desired objectives in this dissertation, we abstracted components such as the cloud environment and monitoring system by using OpenStack and TICK Stack, respectively.

1.1 Motivation

ENAbling Green E-science for SKA (EngageSKA) is a project funded by FCT/POCI (POCI-01-0145-FEDER-022217), P2020 and developed at Telecommunication Institute located in Aveiro. The main goal of the project is to evaluate state of the art with a sustainable plan for green e-science, by fostering infrastructure and participating in the ESFRI SKA project along the Big Data and Green Power axis. At the same time, it aims at securing contributions in the SKA consortia, with strong opportunities

(26)

to participate in the construction and scientific exploration. The current aim of the project in SKA consortia is to provide the infrastructure with cloud computation for scientific development and testing. The cloud computation enables faster and flexible deployment, also simplifying the datacenter architecture. Performing tasks, operations, and maintenance become much more comfortable with cloud technologies as resources are centralized. Due to virtualization, the number of hardware in a datacenter is less, although more powerful for deploying more services, more testing, and more development on just one server.

1.2 Objectives

The main dissertation objective can be split into three parts. The first is to define a structure for the template file to describe the service and SLA rules. The service description must include the required resources to provision and the installation script. The SLA rules must consist of the description of metrics to monitor with a respective threshold, also including recovery methods associated with the monitoring metrics. The second objective is to perform template analysis and deploying the service within the cloud environment using the orchestration mechanism. The third objective is to monitor the service continuously, and if one of the SLA metrics goes against the established threshold, trigger the recovery method by the order specified in the template. Upon recovery failure, notify the user about the recovery failure.

1.3 Contributions

This dissertation work has contributed to the development of the SKA project by providing the template specification and configuration for future services deployed by SKA members [1]. Another contribution is the state of the art for monitoring solutions, Local Monitoring and Control (LMC) [2]. Furthermore, support the CI/CD mechanisms using OpenStack infrastructure of developed work in the paper presented at the ICALEPCS’19 conference [3].

1.4 Thesis structure

To familiarise the reader about the discussed topics in this dissertation, Chapter 2 presents the current state of the art, introducing the latest available technologies and their usability for the work developed in this dissertation. Afterwards, Chapter 3 presents the solution, explaining from a general point of view the architecture and the required components. Chapter 4 shows the used technologies and the methods developed to achieve the architecture established in Chapter 3. The obtained results and

(27)

the performed analysis are described in Chapter 5. Finally, ending with a conclusion in Chapter 6, where future work is also presented. Appendix and References are presented lastly.

(28)

(29)

CHAPTER

2

State of the Art

This chapter reviews the current technologies and methodology related to this work, describing different solutions and showing the advantages and disadvantages of each solution. Some of the presented solutions are for academic purposes only while Open-Source or commercial products currently used by the community. This chapter will help to better understand the available strategies for availability, creating robust availability strategies, and choosing the right tools to use during the implementation phase. 2.1 Cloud Computing

Cloud computing is a relatively new field, consisted of providing resources such as computing power, storage space, and networking for users to use in their own PCs, Smartphones, or tablets via remote infrastructure [4]. Using a cloud environment simplifies the network infrastructure required to do the job of small or large compa-nies, reducing equipment number and maintenance cost [4]. Cloud computing uses a distributive architecture, where resources are centralized, and through virtualization (described at Section 2.2) can be quickly provisioned on-demand [5]. In cloud computing,

the available resources are shared as well as the cost associated with it, where users only pay for what they need at a given time. Figure 2.1 illustrates a simple cloud computing architecture. Infrastructure is the hardware layer of cloud computing and consists of computing servers, networking, and storage units. Service is the software layer, and it runs on top of the infrastructure. The management is responsible for the work coordination among different modules like infrastructure and service layer. All modules presented in a cloud environment must be of restricted access, where only authorized users should be able to access it, verified by security modules. There is an application running on the cloud for the user to access the application, in which

(30)

the browser requests the server. If the request is valid, the browser shows the received answer.

Figure 2.1: Cloud environment architecture.

Since resources can be shared, meaning that the same hardware can be used between multiple users, increasing the efficiency of resource usage [5]. This can be seen as an advantage, but also as a disadvantage if not used properly. For example, if one of the users compromise their system, other users can also be affected [6].

The cloud environment deployment can differ depending on the user requirements and purpose [7], and there are three cloud service models: Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS) and Software-as-a-Service (SaaS). IaaS consists of resource renting, such as processing, storage and network capacity, providing users with ways to remotely run and develop applications [7]. With this service, the hardware maintenance will not be a concern for the user, only the Operating System (OS) configuration will be of their responsibility [4]. Amazon Web Services (AWS) and Windows Server are examples of this.

The PaaS model allows both hardware and OS abstraction, where developers only need to focus on application design, development, testing, implementation, and hosting. Examples of this cloud model include Google App Engine and Windows Azure.

The SaaS model is directed to the common users as it uses a browser as a platform to run the web application hosted on the cloud [7]. This is the most frequently used cloud model [4] and has several popular examples: social networks (Facebook), blogging platform (Blogger), webmail services (Gmail), Internet banking portals, online payment

(31)

three cloud service models previously described. Some applications make the use of multiple cloud models, for example, the SkyDox web application (SaaS), which uses Engine Yard (PaaS) for documentation collaboration, which is deployed on AWS (IaaS) [4].

Figure 2.2: Comparison among three cloud service models

Depending on the company requirements, finance, and data privacy, there are several different cloud environments to go for. Public cloud is available for general public usage, sharing the same cloud resources among all the clients [6], where the cloud is owned and managed by a third party cloud provider [4]. In contrast, the private cloud is devoted to the internal usage of one company, which might be hosted locally at the company or be outsourced to a third party to manage it [6]. The private cloud brings higher data privacy and the advantage of resource usage only by people inside the company or trusted members. Nonetheless, the costs to maintain a cloud running can be high in some cases. On the other hand, the cloud managed by a third-party cloud provider, releases the company from the responsibility of maintaining it, thus lowering the cost, while focusing on the development. As a downside, public cloud resources are used by a variety of users, and data privacy, cannot be fully assured. Some applications might use both public and private clouds. Hosted at the public cloud, but sensitive data is stored at the private cloud for security purposes. This type of cloud environment is called a hybrid. Another example is when the application is hosted at the public cloud, but the data are stored locally [4].

(32)

Cloud computing can often be described as elastic or utility computing [6]. It follows the usage-based model and a user only pays for what needs. In case of requiring more computational resources, those can be requested and acquired from the cloud [6]. Some of the examples are Amazon S3 and Amazon EC2 [6]. As mentioned before, this type of model brings the advantage of lowering up-front investments, using on-demand service and paying only for what is used. For Web applications, it brings the advantage of simultaneously supporting at one instance 100 users and at another 10,000 users without resource wasting, by provisioning and de-provisioning [6].

To better explain cloud computing utility, an example was presented at [6]. For example, 100 servers are needed for 3 years. The first possible solution is to lease the servers, costing approximately 0,40$ per hour each, which will have the total cost of 100 servers * $0,40 unit/hour * 3 years * 8760 hours/year = $1,051,200.

The second possibility is to buy the servers. Assuming that each server will cost approximately $1,500, it would be necessary to hire two staff members at $100,000 per year to manage the servers, and each server consumes 150 watts at the cost of $0.10 per kilowatt-hour, in a total of $13,140 per year. As the sum of all associated costs, it would result in a total cost of

100servers_∗_{$1, 500 + 3}years_∗_{$13, 140}electricity/year +

3years_∗₂staf f _∗_{$100, 000}salary/year _{= $789, 420}

. In conclusion, server leasing can be a little more costly when compared to owning them. But there are more factors, which were not taken into account of server ownership such as, space renting, Heating, Ventilation, and Air Conditioning (HVAC), hardware upgrade, and repair.

2.2 Virtualization

Virtualization has changed the architecture of the cloud environment as it was previously known, where a single computational server was used exclusively to run a single OS. Virtualization enabled a single physical machine to deploy multiple Virtual Machine (VM) to execute multiple applications and services. This allows running multiple and different OS, isolated from each other, on the same physical machine, abstracting from the infrastructure level [5]. One of the reasons why it would be advantageous to adopt this technology is that it keeps up with the growth of large information transfers and it increases datacenter capabilities [8].

(33)

Figure 2.3: Hypervisors types

To achieve virtualization, it is required to have a hypervisor present on the physical machine, and there are two main types of hypervisors: Bare-metal (Type-1) Hypervisor, and Hosted (Type-2) Hypervisor.

The bare-metal hypervisor is running directly on top of the hardware and does not require an OS. It has the disadvantage of requiring a specific hardware configuration, lacking in hardware flexibility.

On the other hand, the hosted hypervisor allows for a wide variety of hardware configurations as it runs on top of the OS. Yet, as shown in Figure 2.3, it has an extra layer compared to the bare-metal type, thus not being as efficient.

The virtualization layer itself has three possible types: Full Virtualization, Paravir-tualization, and OS Level Virtualization. Full Virtualization has direct access to the resources of the physical machine, allowing quicker resource access. Paravirtualization is a type of virtualization at which the OS is aware of being executed inside a VM. Where instead of making direct request operations as in full virtualization, it will perform hypercalls and explicit calls to the hypervisor.

The OS Level Virtualization does not require a hypervisor, the OS plays the role of the hypervisor. This type of virtualization forces all virtual machines to share the same OS, usually used for containerization.

Containerization is an alternative to VM, conceding developers ability to increase their productivity [9]. Containers create an abstraction layer to deploy services quickly and can be started in a matter of seconds, much faster compared to the VM, which can take minutes. As said before, on the downside, all containers must share the same OS, resulting in poorer isolation compared to the VM, and lack in OS compatibility. [9].

The use of virtualization improves resource usage, quickly providing isolated en-vironments for testing purposes, increasing flexibility and utility. Also, it removes

(34)

dependencies of the OS level from the hardware level, where the same hypervisor can be executed on different OS [10].

Hwang et al. [11] presented a comparison between the four most popular virtualiza-tion platform technologies: Hyper-V, Kernel-based Virtual Machine (KVM), vSphere and Xen. Their methodology for hypervisor performance comparison is to use specific benchmark workloads for each resource with 1 Virtual CPU (vCPU) and 4 vCPU. The Bytemark benchmark was used to stress the capabilities of the Central Processing Unit (CPU), from which showed similar result performance among the hypervisors, only minor difference noticed. The Ramspeed1 _{benchmark was used to measure cache}

and bandwidth memory. As a result, it showed a similar performance, as well. How-ever, KVM at multiple vCPU levels showed a 25% performance decrease compared to the other hypervisors. The Bonnie++2 benchmark was used to test disk throughput,

which revealed similar performance behavior, except in the case of Xen, that showed a decrease in performance in the test with character level Input/Output (IO). They run an additional FileBench3 _{benchmark that confirms a decrease of performance for}

Xen. In this benchmark, KVM showed the best performance out of all. The Netperf4

benchmark was used to test network performance, showing similar performances across most hypervisors but not for Xen, which had 22% lower performance compared to the others.

In their discussion, they stated that there is no perfect hypervisor choice. As illustrated in Figure 2.4, it presents a comparison between the hypervisors at different levels, where the best-obtained performance belongs to vSphere and KVM. Different applications could benefit particularly with certain hypervisors, but in general, the vSphere had the best performance at most benchmarks.

Figure 2.4: Hypervisor performance comparison [11]

1_{https://github.com/cruvolo/ramspeed-smp/} 2_{https://linux.die.net/man/8/bonnie++/} 3_{https://linux.die.net/man/1/filebench/} 4_{https://linux.die.net/man/1/netperf/}

(35)

2.3 Service Level Agreement and Availability Metrics

The need for the clients to use cloud computing is increasing on a worldwide scale, creating demand for differentiated service quality. Different clients have different expectations and requirements, forcing cloud providers to establish a Service Level Agreement (SLA), a commitment between the provider and the client to assure on agreed quality of service.

The cloud consumer establishes a set of requirements according to the cloud capabil-ities and available metrics, resulting in an agreement. If the cloud service provider does not obey the established SLA, penalization can occur. For example, Salman Baset [12] compared the SLA of different cloud providers, where Amazon EC2 pays per second 10% of the customer bill if the availability is below 99.95%, and Azure Compute pays 10% if the availability is below 99.95%, or 25% if the availability is below 99%.

To ensure the established set of requirements are being fulfilled, cloud providers must have a monitoring system performing periodical metric collection. Monitoring metrics can be used by the monitoring system to verify VM state, performance and other information. The service performance can be differentiated using Quality of Service (QoS). To establish a differentiated QoS, the monitoring system will use different thresholds accordingly to the clients requirements. One physical server can provision multiple VMs, which will demand from server more resources, and high load, leading to resources shortage, creating the need to monitor each server resource utilization, and balance between all available servers [13].

Resource usage is not the only metric to be monitored, as service availability and performance are also relevant. Some of the important metrics to monitor on a server to ensure the performance and availability are: Pages/sec, Number of CPUs, Guest and Host Memory Usage, Memory Swap In/Out Usage, VM Disk Read and write rate, VM Network Data Receive and Transmit Rate, Virtual Machine Configuration, Virtual Machine State, Response Time, VM Startup and Release Time, Up Time, CPU Usage, Page Faults/Sec, Available Memory, among others [13].

Due to hardware evolution, availability has been improved since 1980, when a typical server had the availability of 99% [14]. While this amount of availability may sound ideal, it results in 100 minutes of downtime per week. This kind of downtime can be acceptable for back-office computers where the work is done asynchronously. However, for critical and online applications, they cannot undergo such downtime. For each instance of time, the service is unavailable, the business can be losing profit, or for some businesses, it ruins the client experience, consequently, the company’s reputation. Ideally, this type of application requires at least high-availability, which is 99.999% of availability, resulting in a maximum of 5 minutes of service denial per year, roughly 5

(36)

System Type _{(minutes/year)}Unavailability Availability_(percentage) Availability_Class Category Unmanaged 50,000.00 90 1 Personal_clients Managed 5,000.00 99 2 Entry-levelbusiness

systems

Well-managed 500.00 99.9 3 E-commerce

Fault-tolerant 50.00 99.99 4 Datacenter

High-availability 5.00 99.999 5 Telephone_network

Very-high-availability 0.50 99.9999 6

Military defense systems

Table 2.1: Classes of system availability [14]

seconds per week. An excellent example of high-availability is a telephone network that requires the availability of level 5 (five nines, 99.999%), resulting in a maximum of 2 hours of downtime within 40 years. Furthermore, Table 2.1 described other availability classes with respective categories in which they are inserted.

Buyya et al. in [15] proposed a dynamic SLA-oriented build using Manjrasoft Aneka [16]. With it, for a given task, it is established a certain amount of time to finish the task, with the possibility of extra resource allocation. The experiment had one task with four different deadlines: 1 hour, 45 minutes, 30 minutes and 15 minutes. The experiment had 4 initial static machines, and when the job began, it executed a provisioning algorithm of Aneka performing cost-optimization where it calculated a minimum required amount of resources to execute the task in the predefined time. If it required more resources to finish the job on time, it could allocate additional VMs at an extra cost.

As a result, for a task to finish in 1 hour, were not required any extra resources finishing the task in 1:00:58. For the deadline at 45 minutes, were required 2 extra VM’s at a total cost of U$ 0.17, finishing in 0:41:06. For the deadline at 30 minutes, were required 6 extra VM’s at a total cost of U$ 0.51, finishing in 0:28:24. For the deadline at 15 minutes, were required 20 extra VM’s at a total cost of U$ 1.70, finishing in 0:14:18. Concluding from the experiment that if a certain task required a shorter deadline, it is possible to provision extra resources and divide the task across the resources to achieve an early end. With more resources, faster is the result and more will it cost. This method can be used to provide extra resources on a higher peak of usage to load-balance the work across the resources.

(37)

2.4 Monitoring

A monitoring system has the purpose of monitoring the host, and issue alerts about the changes. The main components behind the system are ate collector, storage, presentation, and alerting. To retrieve the monitoring metrics from the host, there are two methods: push and pull [17]. The retrieved metrics from the hosts under monitoring are stored and processed at the collector node. It is essential to choose an adequate method to store the monitoring data. Choosing the right storage method can improve performance and scalability. Usually, time-series databases are commonly used in monitoring systems because they can quickly scale and have better usability [17]. The stored metrics can then be visualized at a dashboard in the form of graphs, tables, and more. The presentation helps administrators to analyze historical events and detect abnormalities in the network or host behavior. It can also display real-time statistics [17].

Angelopoulos et al. at [18] presented a visualization software called Grafana. It is an open-source tool for visualization of metrics and time series, also querying and supporting multi-tenant dashboards. It also supports notification generation in case of specific metric exceeds the defined threshold. A typical example of the notification generation is when the host becomes unreachable or unresponsive. Those notifications should be treated intelligently, when a specific metric bounces around the threshold, the notifications should be aggregated and only sent once, avoiding notification spamming [17].

Another purpose of monitoring is preventing security breaches, suspicious events, ensuring the QoS to comply with the established SLA, and more. To keep up with the cloud growth, the monitoring system also has to become more complex, requiring flexibility and scalability to keep the costs low. Monitoring systems can add overheads to the network resulting in a loss on performance, an extra cost depending on the number of monitoring points and a bigger difficulty to maintain.

Monitoring data gathering is made through agents. There are two types of monitoring agents: agent-based and agentless. Table 2.2 shows a short comparison between the two types [17].

An agent-based monitoring system is a third-party software deployed on every host under the monitorization. This system is difficult to implement and maintain as every newly added host requires the individual installation of the agent. In the future, if the configuration of the monitoring system changes, it will require changes to be made on each host. Nevertheless, it can use authentication protocols and only transmits the necessary data, lowering the amount of used bandwidth in the network. To completely remove the network overhead, it is required to separate the monitoring traffic from

(38)

Feature comparison Agentless Agent-based

Deployment Easy Hard

Security Good Better

Network overhead Yes Less

Breadth and depth of monitoring Limited Extensive

Table 2.2: Comparison of network monitoring options [17]

the general network, by adding another network exclusively dedicated to monitoring. However, adding another network for monitoring causes issues in scalability, as it adds more complexity upon adding new hosts, requiring more maintenance [17].

The agentless monitoring system is easy to implement and does not require to be installed on every host. It uses a well-known protocol, Simple Network Management Protocol (SNMP), to push the data to the network so that the collector host can retrieve it. The collector host is a monitoring service responsible for gathering information from the network and store it. The data sent by the host under monitorization is not filtered, and sometimes may send irrelevant information, creating a network overhead and higher bandwidth usage. However, this type of agent has security issues. Agentless monitoring allows remote access to the server using SNMP protocol. Not only enabling the user to retrieve the server’s performance information but also management access, performing actions such as reboot. Authentication protocols exist, but they are not as effective as the ones used by an agent-based system [17].

As previously stated, the monitoring information gathering can be made through pull or push actions. In pull action, the collector asks the host for the specific metrics during a certain interval of time. In the push action, the host sends metrics on a certain interval of time or event to the collector. This method can be adequate to keep the network overhead low if used properly [17].

2.4.1 Related monitoring solutions

Rodrigues et al. at [19] presented an overview of generic monitoring solutions such as Cacti, MRTG, and Nagios, while also comparing them. Cacti and MRTG are used to measure network link consumption, but neither of them provides a self-configuration method or support for host discovery. Nagios was designed for traditional environments, it has the main feature of supporting plugins for multiple metrics gathering, thus adding flexibility, and allowing monitoring in virtually any type of environment. As a downside, Cacti, MRTG and Nagios do not meet the elasticity and do not support self-configuration ability. Some of the presented solutions at [19] focus on specific software such as Amazon WatchCloud, the monitoring system for Amazon Web Services

(39)

previously described solutions, this one has the self-configuration method, allowing users of AWS to configure their clouds easily. The downside of this solution is the restriction it has of only working with AWS products.

Angelopoulos et al. [18] presented possible monitoring solutions within the 5G environment. They presented the advantages of generic solutions such as Zabbix, Nagios, Senso, and Consul. Zabbix is an enterprise-level distributed monitoring system for both network and software applications. It uses an agent-based collection method, having the capability of reporting through push or pull. Its downside is not using a time-series database, producing many false positives, and triggering false alarms. Nagios has numerous add-ons to be used and providing features like multi-tenancy. However, scalability is a big issue in this monitoring system and health checks are difficult to manage in large infrastructure environments [19]. Sensu supports automatic host discovery and metric processing is made by using a queuing system. The scalability is superior to Nagios, but it suffers from a single point of failure. Consul performs better health checks compared to Nagios by using event-based push messages, reducing the network overhead and computational resource allocation. This solution is more scalable comparing the three previous, but it has the disadvantage of the agent availability not being monitored. Another monitoring solution proposed is Monasca, which is integrated within the OpenStack platform. Monasca offers multi-tenancy, scalability, excellent performance and fault-tolerant monitoring-as-a-service communication through REST API. Additionally, it has an alarm and notification engine.

The control and monitoring of the scientific data provided by the SKA telescope are one of the main challenges of the SDP [20]. The SDP consortium from SKA is expected to invest in Openstack, and Monasca is an option as a monitoring tool because it has direct integration with OpenStack. Yet, as a downside, it creates unnecessary network overhead, creating a higher possibility of losing critical information and reducing compatibility with non-Monasca collecting agents, thus requiring higher investment costs in the integration of custom collection agents. As another monitoring framework option besides OpenStack Monasca was considered Collectd, Graylog, Prometheus, and ELK Stack. Collectd is an old metric collection service with numerous integration plugins for a specific application, but it lacks in metric visualization platform and log collection service. Graylog is a tool for log collection and analysis, and it uses ElasticSearch for storage and Fluentd for log collection, but it lacks in monitoring features because it mainly aims for log analysis. Prometheus is a monitoring system featuring metric and alert solutions. It has an adequate performance when used for container monitorization and can be integrated with Grafana for information analysis and visualization, but it lacks in logging solutions. ELK Stack offers monitoring and logging solutions based on ElasticSearch for storage, on LogStash for logs and metrics

(40)

collection and on Kibana for visualization and analysis5.

Another option for the monitoring and logging solutions for the SKA SDP usage where ELK Stack with Prometheus can provide a full solution for monitoring and logging, but possess the disadvantage of not supporting multi-tenancy. Also, Prometheus lacks in metric push method, which would be useful for specific applications. It is also stated that Monasca Agent is a metric collection method that is designed to work with Python 2 and outside of a container. Thus, a problem occurred when trying to run the agent on recent Python 3, and because of Monasca agent plugin use detection routing that is incompatible within the container by default. Detection routing had to be reconfigured manually in order to work properly6.

M. Brattstrom, P. Morreale et al. [17] reviewed the InfluxData Platform solution, also known as TICK Stack. This solution is agent-based, mostly constituted by open-source code and it comes with all necessary tools for system monitoring. The TICK Stack is composed of four components: Telegraf, InfluxDB, Chronograf, and Kapacitor (TICK). Telegraf is a plugin-driven server agent for collecting and reporting metrics, InfluxDB is a time-series database for metric storage that uses SQL-Like queries to interact with the collected data, Chronograf is a platform for data presentation, and Kapacitor is a data processing engine for alert management.

From all presented and reviewed monitoring solutions, only OpenStack Monasca, Nagios, ELK Stack, Zabbix, and TICK Stack were selected as promising ones where Table 2.3 shows the comparison of main features between them. The TICK Stack imple-mentation has the lowest complexity of them all due to its modular and straightforward components, also adding more value due to the plugin support feature. Except for Monasca, the remaining monitoring solutions presented in Table 2.3 also support plugins. The monitoring data collection can be performed both agentless and agent-based using Nagios. Remaining monitoring solutions only support agent-based. The usage of a time-series database is helpful for future event predictions by analyzing past behavior. Monasca and TICK Stack use it while others do not. A multi-tenancy feature, while possible in Nagios, it is paid. In ELK Stack, it is not available, and the rest of them support it. Scalability is an essential feature in a cloud environment. Unfortunately, only Nagios and Zabbix do not support it. Additional monitoring can be performed by reading log files, only ELK Stack, Zabbix, and Nagios support it. However, using logging feature in Nagios require additional costs. In summary, there is no one perfect monitoring solution. It will depend on the setup, the environment, and the requirements.

5_{http://ska-sdp.org/sites/default/files/attachments/sdp_memo_053_-_monitoring_} and_logging_for_the_sdp_part_1_-_signed.pdf

(41)

Features Monasca [18][20] Nagios [19][18] ELK Stack [20] Zabbix [18] TICK Stack [17]

Complexity Medium Medium High Medium Low

Plugin support No Yes Yes Yes Yes

Open-source Yes Yes Yes Yes Yes

Agent/Agentless Agent-based Both Agent-based Agent-based Agent-based

Uses time-series Yes No No No Yes

Multi-tenancy Yes Paid No Yes No

Scalability Yes No Yes No Yes

Logging No Paid Yes Yes No

Table 2.3: Monitoring solution comparison

2.5 Automatic Deployment

There is a variety of OS to be run, and it is impossible for cloud administrators to manually deploy a massive amount of virtual machines in a short amount of time. Therefore orchestrator components were developed to assist humans in this task. The orchestrator is responsible for assigning the cloud resources to each VM instance, such as volume, memory and computing resources [21].

Although the orchestrator can provision resources, it requires the user to specify resources to provision in the form of a template. The orchestrator’s purpose is not only to provision virtual resources, but also to react to unexpected failures, and perform scheduled maintenance. As a fundamental feature, an orchestrator can perform VM migration, which consists of moving one VM from one host to another. As all the resources are virtual, it is easy to perform the migration, and there are two methods: live migration and cold migration [22].

Live migration occurs while the VM is still running. Usually, this method of migration occurs in load-balancing situations. To migrate the VM from one host to another, a set of checks must be made. During migration, new resources are provisioned at the new host and most importantly, they should match in terms of specifications before the start of OS migration. After having passed all of the checks successfully, the migration of the OS takes place and the network traffic is redirected to the new VM, acknowledging the old instance of the successful migration, removing it. During the live migrations, users are not aware of any change, although there are an instantaneous and insignificant latency increment [22].

Cold migration occurs while the VM is powered OFF. Since the VM content is stored in a file or volume, it can be easily migrated from one location to another. The new VM instance is then associated with the new host and, after the migration, the old

(42)

Migration VM state Advantages

Live Migration Powered ON - quick migration_{- facilitates cloud maintenance}

Cold Migration Powered OFF - simple architecture and implementation_{- shared storage not required}

Table 2.4: Comparison of migration techniques [22]

instance is deleted.

On the occasion when the host requires maintenance, the operator needs to perform the migration of VM’s to another host. With the live migration, the VM is in powered ON state, the migration occurs to another host without downtime, not affecting the client’s service. In the case of cold migration, the VM is in a powered OFF state, VM is stored in a file or volume that can be easily transferred to another host. This method is more straightforward in architecture and more comfortable to implement. Table 2.4 describes the summary of the advantages of the two migration methods previously explained [22].

2.5.1 Related automated deployment solutions

Kovács József and Péter Kacsuk [23] presented Occopus, a multi-cloud orchestrator that deploys and manages complex scientific infrastructures. Occopus has features such as multi-cloud support, multiple configuration management tools support, health monitoring, multiple node definition, scaling, on-the-fly dynamic reconfiguration of the infrastructure, interfaces, and error reporting support. They also reviewed some academic orchestration prototypes (Roboconf, Live Cloud, SALSA, IM by GryCAP) and commercial orchestration prototypes (Cloudify, Heat, CloudFormation, Terraform). Roboconf [24] is a cloud orchestrator that supports service deployment, maintenance, and migration between cloud and multi-cloud systems. However, Roboconf lacks in support of configuration management tools.

Live Cloud [25] is a management framework that provides service and resource provisioning. This orchestrator lacks in a multi-cloud support system.

SALSA [26] is a framework for dynamic infrastructure provisioning. It supports not only single-cloud but also multi-cloud systems. It uses its own configuration management tool, providing fine-grained configuration at different levels. However, this orchestrator consists of many services, creating a complex system (in terms of architecture and usage) requiring more human effort to configure and maintain it up-to-date. In contrast to SALSA, Occopus uses third-party configuration management tools, allowing the user to choose the most convenient and adequate configuration for the application.

(43)

Infrastructure Manager (IM) by GryCAP [27] is closer to Occopus than the others, providing a user-friendly interface, by hiding irrelevant details from the user and focusing on Ansible as a configuration management tool. However, it is not flexible enough for the user, who may prefer other configuration management tools, such as Chef7_{. Also,}

IM contains cloud-specific details/attributes, eliminating multi-cloud portability. Cloudify8 _{was released in 2014, making it one of the latest orchestrators. It allows the}

user to set up a life cycle for services and applications, including monitoring of all details of the application, detecting problems and automatically fixing them. Nevertheless, some advanced features such as Web UI are only available in the premium edition.

Heat9 is a template-based orchestrator that provides auto-scaling if integrated with

the Telemetry module and allows configuration management tools such as Chef or Puppet. Interaction with Heat can be done via CLI, API or Horizon Dashboard. However, Heat only supports OpenStack clouds.

CloudFormation10 _{is not an open-source solution, but it is the most mature and}

heavyweight orchestrator in comparison to the previous ones. This orchestrator was developed by Amazon for their AWS clouds with complete exclusivity.

Terraform11 _{is an open-source solution, focusing only on infrastructure deployment.}

There is no lifecycle-management, scaling or error handling. There is no UI, only CLI can be used and the learning curve is much steeper compared to the ones in other orchestrators. Cloud portability was not designed, making it significantly difficult to move to another cloud.

Orchestrator can be also used at the container level. Most commonly used are: Swarm12_{, Kubernetes}13 _{and Apache Mesos}14_{. However, they only perform on already}

existent resources, resource allocation (containers) is not yet supported [23].

The orchestrator can be locked-in to a particular configuration management tool to install and configure service on the deployed resources.

Hochgeschwender et al. at [28] made a comparison between existent configuration management tools. Those tools are mainly used to automate development and system administration tasks such as deployment, testing, and maintenance [23]. The most popular and mature are Chef and Ansible15. Other analyzed tools are Salt16, Puppet17,

7_{https://www.chef.io/} 8_{https://cloudify.co/} 9_{https://wiki.openstack.org/wiki/Heat/} 10_{https://aws.amazon.com/cloudformation/} 11_{https://www.terraform.io/} 12_{https://docs.docker.com/engine/swarm/swarm-tutorial/} 13_{https://kubernetes.io/} 14_{http://mesos.apache.org/} 15_{https://www.ansible.com/} 16_{https://www.saltstack.com/} 17_{https://puppet.com/}

(44)

and roslaunch18. The main difference between them is the requirement of an agent to

operate. Ansible and roslaunch use an agentless architecture, using an SSH session to access the deployment environment and execute the commands. In order to install software, Ansible uses the environment package manager or SCP to copy files. Chef, Salt, and Puppet use an agent-based architecture, using the master node to control, and a client daemon running on the deployment environment to execute commands. In comparison to the agent-based configuration management tool, the agentless does not require to install the agent on the host, only require SSH connection. Making this tool the most popular.

2.6 Availability Mechanisms

Availability mechanisms are essential for overall system availability. Without them, the system upon an error will stop working, resulting in a service failure, and for some businesses can mean profit loss. A generic strategy for dealing with errors is manual intervention. Where the system administrator tries to figure out what happened and how to fix the problem, using this strategy will occur in a significant downtime, and service becomes unreliable. Other mechanisms are required, more automated, without human interaction to deal with errors.

Moreover, in order to maintain a system available, it is essential to know how and where it could fail. Nabi et al. [29] presented state of the art in cloud availability, where they reviewed different types of failures at different levels. Those levels are:

• power failure: loss of energy given to an infrastructure, causing failure of the devices such as networking, computing or storage

• hardware failure: loss of main or secondary system components such as CPU, memory, disk, ventilation and more

• network failure: total or partial networking failure of devices such as a router, switch, firewall, or a virtual network function

• VMM or hypervisor failure: VM manager or hypervisor failure will translate into failure of the virtual instances associated with it

• VM failure: the OS running on top of the VM can fail, leading to unavailability of the VM

• application failure: the application itself can malfunction or fail due to individual component failure

Not all of the failure types can be within the cloud provider scope. For example, the IaaS cloud provider takes responsibility for power, hardware, network, hypervisor,

(45)

and VM availability, but the application level is out of their scope, being the user responsible for components residing at that level. To prevent failures, the cloud provider can use mechanisms to increase availability. Those availability mechanisms can be categorized into three groups: fault tolerance, protective redundancy, and overload protection mechanisms.

Fault tolerance means that service availability suffers a short to no downtime through certain actions. Those actions can consist of failover/switchover, where the failure is fixed by redirecting workload from the failed node to a redundant healthy node [30]. Another action could be restarting the failed node, which will clean the actual failed state of the node to return to an initial condition. Rollback and Roll-forward consist of switching from the failed state to a state that is known to be healthy and correct, such as backup or snapshot.

Protective redundancy consists of having redundant elements, not required when the system is working correctly. The redundant element, which enables the service to work correctly in case of failure of the active node, also can be referred to as a standby node. Within standby nodes, there are two types, Hot Standby and Cold Standby. Hot standby is aware of the current state of the active node and, in case of failure, it replaces it with low to no downtime. Cold Standby, also known as a Spare, is a redundant element, which can be instantiated or uninstantiated on demand. Since a Cold Standby node has no track of the active node state (as it is shut off), it will require some time before it can replace the failed node, thus translating into additional downtime. However, the spare node is not powered on, thus using fewer resources, lowering energy consumption. This type of standby is more effective for applications independent of the current state, in other words, stateless applications. By a combination of different available methods and strategies, different levels of availability can be achieved. For instance, a geographical redundancy model can protect services even against natural disasters, such as flooding, earthquakes, and hurricanes, by distributing the services across different geographical sites [31]. Having multiple computational locations avoids single-point-of-failure, fastens the response to requests, and distributes the workload among them. As an example, Google uses clusters of servers in a distributive architecture across the world, resulting in high-performance, high-availability, high-throughput, and management of a significant number of requests from any part of the world.

Overload protection consists in protecting the system against exceeding components limitations by way of auto-scaling and load-balancing. The auto-scale can also be referred to as elasticity, which is a mechanism of provision and de-provision resources on a schedule demand, or at the threshold of a specific workload. Auto-scaling can work side-by-side with the load-balancer in order to balance the workload between all nodes while optimizing the resource usage [32][33][34]. Load-balancer is a mechanism used to

(46)

distribute the workload across the available nodes. In the case of a node being highly loaded, the excessive workload is transferred to another node that is lightly loaded, ensuring every node has an equal amount of work [35].

2.7 Availability Strategies

The availability strategy is used to maintain the service or system available accordingly to a set of policies. The strategies proposed in [36] are about balancing the infrastructure resource usage, triggering automatic migration to the next available host under high load. The migration should be triggered on the hosts that are marked as a hotspot, due to being under high computational demand. A hotspot is a host at which the CPU utilization, memory, or bandwidth occupied are above the established threshold. The next available host is a host where the resource usage is below the threshold and has lower resource usage compared to other available hosts. This way, the resource shortage is avoided, and all the available hosts distribute the workload [36].

The strategy proposed in [37] is about ensuring the established SLA with the client. SLA constraints can be different for each service and if the host cannot fulfill SLA established by the user, migration is triggered to the next available host that is fulfilling SLA.

Both strategies are well suited to be used simultaneously to provide better coverage in the overload protection. Both translate into an adequate proposal to obtain a high-available system for supporting critical services against failures.

2.7.1 Solutions

Wubin Li and Ali Kanso in [38] compared containers to virtual machines in evaluating the achievement of high availability. From all of the compared solutions, the following were found as the most relevant for this dissertation: Docker Swarm, Kubernetes, and VMware. VMware is a commercial solution that allows virtualization using a bare-metal (type 1) hypervisor without requiring an underlying OS. The strategy of VMware to handle failover is to use clustering, requiring all virtual disks to be on shared storage. Every host must have a Highly-Available (HA) agent communicating heartbeats to the cluster, to acknowledge its presence and state. VMware HA can protect against three types of failures: host failure, guest OS failure, and application failure. Upon host failure, VMware will restart the VM on another host. If Guests OS stops working or the application fails, VMware resets the VM on the same host. VMware also features a continuous availability functionality, which is a failover mechanism allowing zero downtime. The zero downtime is only possible if there are two VMs, on different hosts, where the first VM synchronizes every hardware instruction with the second VM

(47)

supported by Intel and AMD processors. So far, VMware FT only supports a single vCPU per FT-enabled VM, and requires a fast network of at least 1 Gbit/s. However, having only one vCPU is not enough for the majority of critical services. At VMworld 2012 and 2013, Jim Chow et al.19 _{presented Symmetric for Multi-Processor Fault}

Tolerance (SMP-FT). SMP-FT allowed FT to support multiprocessing, up to 8 vCPUs per FT-enabled VM, but it requires even faster network link, with speed of at least 10 Gbit/s. The SMP-FT uses a new technology, called FastCheckpointing, allowed to increase the amount of vCPUs by only executing the CPU instructions at the primary VM and only sending the instructions result to the secondary [39].

In the same article [38], Kubernetes, a containerization solution that provides service high-availability is analyzed. Kubernetes groups the tightly coupled containers into pods, and the loosely ones into key/value labels. The labels consist of metadata, which describe the semantic service function. The master node is responsible for maintaining the cluster status and the communication between the resources. Kubernetes also has the principle of ReplicaController, where a given pod has a predefined number of replicas always running. In the case of a pod or service failure, traffic is redirected to a replica, and the failed pod is re-instantiated on a healthy node. In the case of a master node failure, traffic is redirected to a replica master, covering the host failure situation. Kubernetes is equipped with failure detection and failover mechanisms. Those mechanisms allow recovery from host, guest OS and application failure. Although Kubernetes provides high-availability, it lacks in mechanisms for state preservation for service continuity, however being adequate for stateless applications.

Richter et al. [40] presented Docker Swarm as a solution for a microservice architec-ture. Docker Swarm has a Swarm manager, which controls the cluster of containers. The manager can have multiple replicas of itself to failover in case if the primary manager is unavailable, covering the host failure scenario. The communication protocol used is RabbitMQ, which allows to distribute and replicate messages across all RabbitMQ nodes to prevent message loss.

Heidari et al. [41] presented Heat, an orchestration service designed to work with OpenStack clouds. Heat provisions VM in a stack accordingly to the description provided in the template file and can monitor the VM/App state. Heat can restart the instance on VM failure or when the application is not responding, but it cannot perform failover. During the stack deployment, Heat sends the configuration to the metadata server, which communicates directly with the VM to configure the monitoring tools inside the instance. Periodically, VM reports the information about the VM and service state to the metadata server. Then, Heat will pull the monitoring information

(48)

Features VMware [38][39] Kubernetes [38] Docker Swarm [38][40] Heat [41]

Live migration Yes No No Yes

Checkpoint/Restore Yes No No Yes

Failure detection Yes Yes Yes Yes

Failover management Yes Yes Yes Yes

Guest OS Any Linux Linux Any

Recover from host failure Yes Yes Yes Yes Recover from Guest OS failure Yes Yes Yes Yes Recover from application failure Yes Yes Yes Yes

Service continuity Yes No No No

Table 2.5: Orchestration system comparison

from the metadata server. However, if Heat does not receive information from VM in time, it will assume that the VM failed and will try to recover by restarting the stack. The interval for the monitoring can be defined in a Heat template. The application layer works in a similar way. If Heat does not receive information from the application, it will restart the application. In case of it still not receiving the application status, Heat will escalate to a VM failure and restart the whole stack. Although Heat can protect against application and VM failures, it does not provide service continuity in case of failure.

Table 2.5 provides a comparison of the availability feature between the most relevant orchestrators. In a containerized environment (Docker Swarm and Kubernetes), live migration and restoring methods are not being possible. The failure detection and failover management are possible on four orchestrators. As VMware and Heat provision VMs, thus can support deployment of all OS types. In contrast, Docker Swarm and Kubernetes allow only Linux based OS. Every presented orchestrator supports the recovery features, but only VMware can assure the service continuity. In summary, VMware has overall better availability support, but it is a commercial solution and is not Open-Source. The second best orchestrator is Heat, an Open-Source solution, and free to use.

(49)

CHAPTER

3

SLA constraints in OpenStack

Stacks

There are numerous ways of keeping service running correctly. This chapter will focus on the architecture design that complies with SLA requirements and approaches upon found failures. SLA is a commitment between the provider and the client to ensure on the agreed quality of service. As described previously, some of the SLA attributes are Availability, Performance, Latency, Support, Network, and Application. The support is the availability time of the support team to assist the customer when it requires help. The support attribute will be out of the scope of this dissertation since the cloud provider itself provides support. This dissertation’s objective is to cover the IaaS cloud model, where the critical application should be guaranteed to run according to an established set of SLAs. Also, at the IaaS cloud model, the service provider is responsible for ensuring the SLA of the infrastructure, middleware, and application layers. To cover all those levels accordingly to SLA, it is essential to monitor various aspects of the system, from hardware to software layer. To better formulate the solution towards the objective, it required to settle on the cloud environment technology. Creating a generic architecture solution without a specific cloud environment could lead to major changes and further investigation during the implementation phase. Resulting in choosing OpenStack, an open-source could environment because it will also be used by SKA members, since SKA SDP also plans to use it.

3.1 SLA Characteristics

This section will describe the SLA characteristics, explaining how to quantify them, what are the metrics associated with them, how to monitor them, and how to reestablish them from an unexpected failure.

(50)

3.1.1 Availability

Availability can be described as the probability of the system working correctly when it is required to. Having a more reliable system helps to improve the availability, the higher the reliability is, the higher is the chance of a system to be available. It is worth noting that reliability is only useful when the system has not failed yet. When the system fails, the important aspect to take into account there is the maintainability, which is the amount of time required to recover from a failure. If one of those aspects is not considered, it will cause a decrease in availability, just as choosing cheaper hardware will cause a decrease in reliability, or creating a very complex system architecture will decrease the probability of quickly recovering from a failure. On the other hand, enhancing only one of the aspects without changing the other will increase the overall availability. In conclusion, in order to achieve better availability, we must take into account the reliability and maintainability by evaluating and choosing the tools according to our needs and budget.

Service availability is the ability for a user to have access to the information at all times if they have clearance to it. A system malfunction or a compromise of data security will affect the availability negatively when the information is not secure or not easily available. Thus, availability must be part of the SLA attributes, which internally are constituted by rules and thresholds used to maintain the service running as it is required.

Service reliability can also be increased by adding redundant equipment that prevents service failure by switching or redirecting the traffic to a redundant and healthy system. Those redundant devices are usually used at the scale of data-centers, which can be classified into tiers accordingly to the ANSI/TIA-942 Data Center Standards1_{. A Tier I}

data-center has the lowest requirements, having an availability of 99.671%, and lacking in redundant equipment. Tier II is a more complete system, with the availability of 99.741%, with one spare equipment (N+1), as well as power and cooling redundancy. Tier III has the availability of 99.982%, where all IT devices are fault-tolerant, duplicated (N+1), and dual-powered from different energy sources. The network links are also duplicated, with two different service providers. Tier IV is the highest of them all, providing availability of 99.995%, where equipment redundancy is 2N+1, with even more power, network, and cooling redundancy. This tier is usually used for mission-critical systems, and designed for large organizations with a maximum downtime of 26 minutes per year.

The percentage of availability can be determined by monitoring the service at a particular time interval. The monitoring interval for metric collection can be set

Orchestration of cloud services with critical components in SKA

DZIANIS

BARTASHEVICH

Orquestração de Serviços Cloud com

Componentes Críticos no SKA

Orchestration of Cloud Services with Critical

Components in SKA

DZIANIS

BARTASHEVICH

Orquestração de Serviços Cloud com

Componentes Críticos no SKA

Orchestration of Cloud Services with Critical

Components in SKA

DZIANIS

BARTASHEVICH

Orquestração de Serviços Cloud com

Componentes Críticos no SKA

Orchestration of Cloud Services with Critical

Components in SKA

Contents

List of Figures

List of Tables

Glossary

CHAPTER

1

Introduction

CHAPTER

2

State of the Art

CHAPTER

3

SLA constraints in OpenStack

Stacks