SLA constraints in OpenStack Stacks
3.5.2 Service Recovery
The workflow in Figure 3.12 shows the steps required to recover the service. The starting point isMonitor SLA metrics, where the monitoring system checks the received metrics from the agent inside the virtual instance for possible SLA violations. If the metric value is not in agreement with the alert threshold, the system moves to the next state to obtain the associated recovery methods with the alert. Otherwise, the monitoring system continues in the first state, checking for SLA violation.
According to the recovery methods established in the SLA enhanced template, the system should execute the recovery method and return to the first state. If the system
manages to recover the service, meaning that the metric value comes into agreement with the alert threshold, the service is marked as recovered. Otherwise, the system should move to the next available recovering method that is associated with the alert until the service is recovered. If the attempt to recover the service fails, the service enters into the Failed to recover state. And if there are no available recovery methods at all, the alert enters into the Failed to recover state.
Figure 3.12: Service recovery workflow
Software Configuration
Even though the HOT file supports shell script deployment (property user_data) which can be sufficient for some software configuration, but it is limited and can be challenging to set up complex environments. Many developers have good knowledge of Ansible and use it regularly to set up the environment or for the software deployment due to a high amount of available configuration modules. As one of the dissertation objectives is to create a more straightforward method to deploy software, it was decided to add the Ansible option into the HOT file. For the orchestrator to execute the Ansible playbook, the property ansiblemust be set with the path to the Ansible playbook and the path must be relative to the location of the SLA enhanced template.
Listing 5 shows the Ansible script definition within the SLA enhanced template.
...resources:
resource_name:
type: OS::Nova::Server properties:
content:...
ansible:
file: String extra-vars: String ...
Listing 5: Definition of Ansible script
CHAPTER 4
Implementation
As described in the previous chapter, the proposed architecture is composed of four main components: service manager, cloud environment, monitoring system, and recovery engine. This chapter will focus on the implementation of those components and will explain the reasons behind the choices of the used technologies.
4.1 SLA enhanced template
This section serves as an index page for the implemented template, directing each template property to the respective section where it is explained in detail.
Listing 6 depicts the implemented SLA enhanced template in a YAML format. The choice of YAML format for the SLA enhanced template was based on the orchestrator template. Since Heat orchestrator uses the YAML template format, it would be easier to extract the orchestrator related information during the SLA enhanced template split, not requiring additional changes or transformations. Besides, YAML is human a friendly data serialization standard, making it easier for client usage. The final result, depicted in Listing 6, is the modified Heat template, adding sla andcontent properties. The SLA enhanced template is analyzed and processed by the service manager (further explained), splitting it into four individual sections.
The first section is the OpenStack Heat orchestrator template, describe the required resources to provision, the underlying infrastructure. Prop-erties heat_template_version, heat_stack_name, description, resources, resource_name, type and properties are orchestrator related. This template part is further explained in Section 4.3.
The second section is the Ansible template used to configure the service on top of the underlying infrastructure. Properties content,ansible,file andextra-vars are
software configuration related. This template part is used by the service manager to configure the service, further explained in Section 4.4.
The third section is SLA rules used by the TICK Stack monitoring system (further explained), to check for the SLA violations. The external andinternal properties are responsible for service monitoring from an external and internal point of view, respectively. This template part and the monitoring system is further explained in Section 4.5.
The last section is the recovery methods, used by the recovery engine (further explained) to instruct the orchestrator on how to recover the service. Therecovery property is recovery engine related, further explained in Section 4.6.
heat_template_version: Date
To achieve the solution described in Chapter 3, considering the work complexity and data dimension, the best strategy for prototyping is to use a cloud environment, dividing resources into modules, such as computing, networking, storage, and others. This way, it is possible to use available resources more efficiently.
OpenStack is considered to be the best-fit platform for managing resources because it allows Virtual Machine (VM) deployment, use of container technologies and even running VMs on top of bare metal for better performance. Another reason for choosing
use OpenStack in their implementation, it would make sense to follow their decision and contribute if possible.
OpenStack is an open-source software platform used for cloud computing to create Infrastructure-as-a-Service (IaaS) with the ability to interconnect separated modules into a single cloud service. Therefore, providing a reliable abstract platform using physical infrastructure. This allows the dedication of the entire server for a single resource, and in case of using high availability, creating an exact server replica and load balance the traffic between them. This allows to spread the traffic across the servers evenly, and in case of failure, redirect all traffic from the failed server to a healthy one.
Figure 4.1 represents the currently deployed OpenStack structure of the platform.
Figure 4.1: OpenStack module distribution
The OpenStack setup required the installation and configuration of several modules to allow the work development towards the dissertation objective. The main OpenStack module is the computing service, called Nova, responsible for VM deployment. In order to deploy VM, the Nova module has to use virtualization. Accordingly, to the State of the Art in Chapter 2, there are three types of virtualization: full virtualization, paravirtualization and OS level virtualization. In this dissertation work, performance is
critical, leading to choose full virtualization, as it is the only type that allows direct hardware access, resulting in quicker operations.
Having a virtualization type in mind, what is left is to choose an appropriate virtu-alization technology. Accordingly, to the State of the Art, the virtuvirtu-alization technology vSphere, and Kernel-based Virtual Machine (KVM) showed better performance results out of the four tested hypervisors. Since the chosen cloud environment was OpenStack, vSphere had to be removed from from the hypervisor candidate list, because it is not compatible. Remaining KVM, which runs on top of Linux OS, converting it into a bare-metal (type-1) hypervisor, ideal for achieving better performance.
For the VM, to access the internet or to communicate with other VMs, the Neutron module was configured. As Neutron does not require a lot of computation or storage resources, it can be moved to the same node where the controller and monitoring system reside.
Some VMs might require extra storage space, and for that, required to have the Cinder module. The used file system for the storage node was GlusterFS. The choice of GlusterFS as the underlying file system for Cinder volume storage was based on the fact that it can handle 64 Terabytes (TB) per node, significant volume size that can be easily scaled, because GlusterFS stores information in blocks that can be easily moved, changed and even mirrored between storage nodes.
The implementation of the cloud environment is part of the established requirements.
Another requirement is the implementation of an orchestrator to automatically provision resources required to run the service. After the choice of OpenStack for the cloud environment, choosing the orchestrator technology was easier. OpenStack has an official orchestrator, Heat, fully compatible with OpenStack. It would not make sense to use any other orchestrator as it would never be as fully compatible comparing to Heat.
OpenStack also provides a web-based dashboard to interact with the cloud, called Horizon, allowing clients to create their own environments with little knowledge.
4.3 OpenStack Heat template
This subsection describes the Heat Orchestration Template (HOT) file structure and some of the available parameters. This information was retrieved from the OpenStack HOT guide1 and OpenStack Resource Types2.
A HOT file uses the YAML format, and is processed by the OpenStack Heat module to provision resources, with the possibility of having more than one resource setup, and allowing to build the whole project architecture from a single configuration file. Also,
1https://docs.openstack.org/heat/rocky/template_guide/hot_guide.html
2https://docs.openstack.org/heat/rocky/template_guide/openstack.html
it is possible to configure networks, subnets, assigning ssh key-pairs to the instance, assigning floating IPs, provision and manage cinder volumes, deploy a monitoring system, and more.
Throughout the years of Heat development, new features were added and some were discontinued. For Heat to know which specifications are valid and which are not, property heat_template_version must be set with the HOT release date, e.g.
2015-10-15. Another Heat template related variable to set isheat_stack_nameproperty, used to distinguish between other Heat stacks. This property is not part of the original HOT file, it is one of the added modifications. The description variable is optional, but it is encouraged to use. It describes what users can do with this template when shared with someone not aware of it.
When provisioning resources with Heat, all resources must be named (resource_name) and added under resources property like shown in Listing 7. Each resource has a typeproperty, used to specify the resource type to provision. The most commonly used resource types are displayed in Table 4.1. After choosing the right resource type, it is possible to define properties associated with the resource, in case of OS::Nova::Serverthe most common properties are:
• flavor - name of the pre-defined OpenStack flavor, used to define virtual instance characteristics, such as CPU, RAM and more.
• image - name of OS image for virtual instance to boot
• key_name - name of ssh keypair injected into the virtual instance
• networks - list of networks connected to the virtual instance
• user_data - user data script to be executed by cloud-init during boot
heat_template_version: Date
In OpenStack, flavor defines the compute, memory and storage capacity of the virtual instance, in summary, the available hardware configuration the instance can
OpenStack resource type Definition
Table 4.1: Some OpenStack resource types
use. Having the hardware, it is required an underlying OS to run tasks. The Glance module is responsible for managing the OS images, which can be available in a variety of formats, the most used isqcow2.
Some of the properties of OS::Nova::Server are exemplified in Listing 7. This template will provision a virtual instance with default tools if the OS image was not changed. In some cases, users might want to set up their own environment inside the virtual machine. To achieve that, a software configuration script can be used with Heat.
Specifying shell script content in user_date property, will pass to the instance and executed by cloud-init during the booting time. Cloud-init is a standard method used in cloud environments to initialize the virtual instance accordingly to a set of defined parameters. It can also execute custom commands specified by the client.
To remotely access the virtual instance, either the OS image comes with a predefined remote access password, or it is required to associate an ssh key. An SSH key-pair is created byOS::Nova::KeyPair resource, associated with a name property to be used further in the file. To import an already existent key, the public_key field must be set with the content of the public SSH key. If the public_key field is not specified andsave_private_key field set as true, will be created a new key-pair and saved the private key. The example of how to define the KeyPair resource is demonstrated in Listing 8.
Listing 8: Create a key pair
To increase the storage space, VM supports multiple volume attachments, managed
field is not set, the created volume is empty. Setting the image field with a valid OS image will create a bootable volume. When creating a new volume, it is required to specify the size property of the volume, unless, the new volume is originated from another volume (source_volidpropriety), a snapshot volume (snapshot_idpropriety) or a backup volume (backup_id propriety), the size of the source volume will be used.
During the volume attachment, property volume_idmust be set with a valid ID of the volume and instance_uuidproperty with a valid ID of VM to which the volume will be attached. Listing 9 exemplifies the volume provisioning and attachment.
resources:
Listing 9: Create and attach a volume to an instance
In most cases of cloud environments, the number of public IPs available is very limited. Requiring the use of internal networks for VMs. If required to expose VM to the public, FloatingIP can be used. OS::Neutron::FloatingIP resource is responsible for the allocation of floating IP’s from any network. The property floating_network must be set with a valid network name to allocate IP from. To associate the IP to VM, OS::Neutron::FloatingIPAssociation resource is used, where property floatingip_id must be set with the ID of the Floating IP and property port_id set with the network port ID associated to VM. In the situation when VM is deleted, FloatingIP will be released to the IP pool or will be reassigned to another VM upon user request. Listing 10 shows how to allocate a floating IP and associate it to the VM.
resources:
Listing 10: Allocate and associate a floating IP to an instance 4.4 Service manager
The service manager is the component where all the action will start. Figure 4.2 depicts the class diagram of the service manager implementation and the directly associated classes. The main variables of the class areopenstack_auth, the OpenStack authentication credentials to interact with the OpenStack API, template_path, the path for the SLA enhanced template andaction_type, to specify if the service should be deployed or destroyed. The variableaction_typevalue can be either,deployto trigger service deployment, orremove, to remove the service. Functionservice_deployment(), is the main function of the service manager, while other functions are secondary, used by the main function.
After template submission by the client, it will be analyzed using template_analysis() function and using template_split() function will be split into four parts: orchestrator template, Ansible template, SLA template and recovery template. Function template_analysis() uses yaml.load() function, to verify the template structure. If the function yaml.load()executes without errors, it means the template structure analysis has successfully passed. Function template_split()will split the template by extracting properties that are not Heat related, meaning every variable that is not used by Heat orchestrator, will be extracted from the template to a separate one. The SLA enhanced template is available in Section 4.1, Listing 6.
Property content is extracted from the template, creating an Ansible template, later used by the service manager to configure the service using theansible-playbook tool. Within theslaproperty,externaland internalare extracted to create an SLA template. This is used by the monitoring system to create monitoring metrics, alert definition and dashboard statistics. The remaining property within slais recovery, also extracted, creating a recovery template used by the recovery engine. After the extraction of the previous properties from the template, the only properties left are
Figure 4.2: Service manager implementation architecture
Heat related, resulting in a HOT file, used by OpenStack Heat orchestrator to deploy the underlying infrastructure.
After template analysis, if variable action_typeis set to deploy, service manager using OpenStack API verified if stack already exists (stack_exist()), and if not OpenStack Heat receive the HOT file and using deploy_stack() will deploy the stack.
A stack is a collection of provisioned resources by Heat. If action_type is set to remove, with property heat_stack_name from the template, OpenStack Heat will executeremove_stack(String heat_stack_name) to remove the stack.
At the stack provision completion, the service manager will install and configure the service, using the ansible-playbook tool, accordingly to the Ansible template. After the service manager finishes to config-ure the service using perform_software_installation() function, using perform_agent_configuration() function will deploy Telegraf monitoring agent in the VM environment, explained in Section 4.5, used to report SLA metrics to the monitoring system.
Following the monitoring agent configuration, we define in the monitoring system the thresholds for each alert from the SLA template using the Kapacitorstore_sla_rules() function. Also, using Chronograf create_sla_dashboard() function, create a visual-ization dashboard to view current SLA monitoring statistics. If variableaction_type is set to remove, using remove_sla_dashboard() function, the dashboards associated with the service will be removed.
Finally, the recovery template is stored in the OpenStack database, using the store_recovery_methodsfunction, and IP the address to access the service is displayed.
4.5 Monitoring subsystem implementation
The technology chosen for the monitoring system implementation is the TICK Stack, offering low complexity, supports a high number of plugins and uses a time-series database. TICK Stack monitoring is constituted by Telegraf, InfluxDB, Chronograph and Kapacitor (TICK). Telegraf is an agent deployed in each VM to collect and report measured metrics. Telegraf agent uses plugins to monitor different metrics, and allows the creation of generic plugins, allowing the implementation of custom monitoring scripts. A complete list of official plugins can be found in the documentation of the InfluxDB official website3.
The creation of generic plugins helps to obtain certain metrics such as network performance, not supported by official Telegraf plugins. As network performance is part of our solution, we had to implement it, creating a Python script using the iPerf2
3https://docs.influxdata.com/telegraf/v1.11/plugins/plugin-list/
tool to obtain the network performance, returning required metrics, such as bandwidth, network jitter and percentage of the lost network packets. The reason for choosing iPerf2 instead of iPerf3 (newer version) is due to the possibility of receiving multiple network measurement requests simultaneously, which was discontinued in a newer version due to no good reason to support its existence4. The developed script accepts variables host, protocol, interval, bandwidth and name as program arguments. The script output varies accordingly to the protocol. If the protocol is tcp, the only received output is the tcp bandwidth. And if the protocol is udp, the output metrics are udp
tool to obtain the network performance, returning required metrics, such as bandwidth, network jitter and percentage of the lost network packets. The reason for choosing iPerf2 instead of iPerf3 (newer version) is due to the possibility of receiving multiple network measurement requests simultaneously, which was discontinued in a newer version due to no good reason to support its existence4. The developed script accepts variables host, protocol, interval, bandwidth and name as program arguments. The script output varies accordingly to the protocol. If the protocol is tcp, the only received output is the tcp bandwidth. And if the protocol is udp, the output metrics are udp