• Nenhum resultado encontrado

Parallel implementation proposal of clustering algorithms in hardware

N/A
N/A
Protected

Academic year: 2021

Share "Parallel implementation proposal of clustering algorithms in hardware"

Copied!
91
0
0

Texto

(1)

UNIVERSIDADEFEDERALDO RIO GRANDE DO NORTE

UNIVERSIDADEFEDERAL DORIOGRANDE DO NORTE

CENTRO DETECNOLOGIA

PROGRAMA DEPÓS-GRADUAÇÃO EMENGENHARIAELÉTRICA E DECOMPUTAÇÃO

Parallel Implementation Proposal of Clustering

Algorithms in Hardware

Leonardo Alves Dias

Advisor: Prof. Dr. Marcelo Augusto Costa Fernandes

Doctoral Thesis presented to the Graduate Program in Electrical and Computer Engin-eering of UFRN (concentration area: Com-puter Engineering) as part of the require-ments for obtaining the degree of Doctor of Science.

PPgEEC Order Number: D275

Natal, RN, October, 2020

(2)

Dias, Leonardo Alves.

Parallel implementation proposal of clustering algorithms in hardware / Leonardo Alves Dias. - 2020.

91 f.: il.

Tese (doutorado) - Universidade Federal do Rio Grande do Norte, Centro de Tecnologia, Programa de Pós-Graduação em Engenharia Elétrica e de Computação.

Orientador: Prof. Dr. Marcelo Augusto Costa Fernandes.

1. Massive data sets - Tese. 2. Data clustering - Tese. 3. Parallel systems - Tese. 4. Hardware - Tese. I. Fernandes, Marcelo Augusto Costa. II. Título.

RN/UF/BCZM CDU 004.451

Catalogação de Publicação na Fonte. UFRN - Biblioteca Central Zila Mamede

(3)

A Deus, pela oportunidade e força

de vontade durante a realização

deste trabalho, e a minha mãe por

sempre acreditar em mim.

(4)
(5)

Agradecimentos

A palavra obrigado é utilizada para expressar gratidão. Porém, em alguns casos, não é simples expressar o real sentimento de gratidão, ou de mensurá-lo, apenas com palavras. Apesar disso, quero aqui agradecer, através de palavras, aqueles que contribuíram para realização deste trabalho.

Quero agradecer primeiramente a Deus, que diante de sua bondade, me proporcionou esta oportunidade e força de vontade para cumprir este propósito.

A minha família, que sempre me apoiou em todos os momentos, em principal a meus pais e irmã, e que sempre acreditaram no meu potencial para superar qualquer desafio.

Ao meu orientador, Marcelo Augusto Costa Fernandes, por toda paciência e dedicação. Além de um excelente orientador na parte técnica, me ensinou como ser um bom pesquis-ador.

A meus amigos, que não é possível citar todos aqui, mas que me deram apoio no decorrer da realização deste trabalho.

A todos os meus colegas do LAMII e da pós-graduação, que sempre estiveram comigo nessa jornada, ainda pelos momentos hilários de descontração, e por sempre estenderem a mão ao dedicar um pouco do seu tempo para me ajudar com as dificuldades.

A coordenação do PPgEEC e a UFRN, por disponibilizar o ambiente adequado para real-ização dos meus estudos.

A todos os professores do PPgEEC, e demais que fizeram parte da minha vida acadêmica, e por quem eu tenho uma profunda gratidão.

À CAPES, pelo apoio financeiro durante o meu doutorado sanduíche na Inglaterra.

Por fim, a todos que direta ou indiretamente contribuíram para a realização desse trabalho, meus sinceros agradecimentos.

(6)
(7)

Abstract

This work presents a study on data clustering algorithms implemented in dedicated hardware for applications in general, aiming to increase the processing speed. Cluster-ing algorithms have been widely adopted to find patterns between data in different areas. However, these algorithms usually imply high processing complexity and, in addition, the amount of data currently stored is massive. Therefore, the need for high-throughput data processing has become even more critical, especially for real-time applications. One solu-tion that has been adopted to increase processing speed is the use of parallel techniques implemented on dedicated hardware, which has proved to be more efficient compared to sequential systems. Therefore, this work proposes the fully parallel implementation of data clustering algorithms in hardware to optimize the processing time of systems in several areas, enabling applications for systems with a massive amount of data. A new proposal for implementations of the clustering algorithms K-means and Self-Organizing Maps are presented, together with an analysis of the results related to throughput and the hardware resource for different parameters, showing a speedup of millions of data points and connections updated per second. The implementations presented here point to a new direction associated with the implementation of clustering algorithms and can be used in other algorithms.

(8)
(9)

Resumo

Este trabalho apresenta um estudo sob algoritmos de clusterização de dados imple-mentados em hardware dedicado para aplicações em geral, objetivando aumentar a ve-locidade de processamento. Algoritmos de clusterização têm sido amplamente adotados para encontrar a padrões entre dados, em diferentes áreas. No entanto, estes algoritmos normalmente implicam em uma alta complexidade de processamento e, além disso, a quantidade de dados armazenados atualmente é massiva. Sendo assim, a necessidade de processamento de dados com alto throughput tornou-se ainda mais importante, especial-mente para aplicações em tempo real. Uma solução que foi adotada para aumentar a velocidade de processamento é o uso de técnicas paralelas implementadas em hardware dedicado, que provou ser mais eficiente em comparação com sistemas sequenciais. Logo, este trabalho propõe a implementação totalmente paralela dos algoritmos de clusteriz-ação de dados em hardware para otimizar o tempo de processamento dos sistemas em diversas áreas, possibilitando aplicações para sistemas com quantidade massiva de dados. Uma nova proposta de implementação dos algoritmos de clusterização K-means e Self-Organising Map são apresentadas, juntamente a análises dos resultados relacionados ao throughput e o recurso de hardware para diferentes parâmetros, mostrando um aumento na velocidade de processamento de milhões de pontos de dados e neurônios atualizados por segundo. As implementações apresentadas aqui apontam para uma nova direção as-sociada a implementação de algoritmos de clusterização e poderá ser utilizada em outros algoritmos.

Palavras-chave: Data sets Massivos, Clusterização de Dados, Sistemas Paralelos, Hardware.

(10)
(11)

Contents

Contents i

List of Figures iii

List of Tables v

List of Symbols and Abbreviations vii

1 Introduction 1 1.1 Motivation . . . 4 1.2 Objective . . . 5 1.3 CAPES PrInt . . . 6 1.4 Published papers . . . 6 1.5 Thesis organization . . . 7 2 K-means algorithm 9 2.1 Related works . . . 9

2.2 The K-means algorithm . . . 13

2.3 Implementation Description . . . 15

2.3.1 Distance Metric Module . . . 16

2.3.2 Clustering Process Module . . . 18

2.3.3 Mean Centroid Module . . . 20

2.3.4 Centroid Register Module . . . 22

2.4 Experimental results . . . 23

2.4.1 Comparisons with the stat of the art works . . . 30

2.5 Discussions . . . 33

3 Self-Organising Map Algorithm 35 3.1 Related Works . . . 35

3.2 Self-Organising Map . . . 37

3.3 Implementation Description . . . 39

3.3.1 Neuron Block (NB) . . . 39

3.3.2 Winning Neuron Tree (WNT) . . . 43

3.3.3 Learning Memory Module (LM) . . . 45

3.4 Experimental Results . . . 46

3.4.1 Comparisons with state of the art works . . . 51

3.5 Discussions . . . 52

(12)

4.2 Self-Organising Map . . . 57

5 Conclusion 61

5.1 Future work . . . 62

(13)

List of Figures

2.1 General architecture of the proposed parallel K-means algorithm

imple-mentation. . . 16

2.2 A g-th distance metric (DM) submodule for a j-th input data point, pgj[m], and K centroids, ck[m]. . . 17

2.3 A g-th clustering process (CP) submodules for K centroids. . . 19

2.4 Mean Centroid (MC) submodules for K centroids. . . 20

2.5 Circuits that constitute the k-th update centroid (UCk) submodule. . . 21

2.6 Set of registers that constitute the CR module, used to store each D-dimension of all K centroids, ck[m](n). . . 22

2.7 Synthetic two-dimensional Gaussian data set and its initial centroids, ran-domly generated. . . 24

2.8 Clustered two-dimensional Gaussian data set after K-means runtime for K= 4. . . 25

2.9 Total resources used by varying g and K for D = 2 and m = 14. . . 26

2.10 Throughput obtained by varying g and K for D = 2 and m = 14. . . 27

2.11 Total resrouces used by varying D for g = 4, K = 8 and m = 16. . . 28

2.12 Throughput obtained by varying D for g = 4, K = 8 and m = 16. . . 28

2.13 Total resources used by varying m for g = 1, K = 4 and D = 2. . . 29

2.14 Throughput obtained by varying m for g =, K = 4 and D = 2. . . 29

3.1 General architecture of the proposed SOM algorithm implementation. . . 40

3.2 Two-dimensional SOM of the proposed architecture. . . 41

3.3 A k-th Neuron Block, NBkP,Q. . . 42

3.4 Submodules of Winning Neuron Tree (WNT) module. . . 44

3.5 LM module and its submodules. . . 46

3.6 Synthetic two-dimensional data set and its initial neurons, randomly gen-erated. . . 47

3.7 Resulting Map of neurons final position after learning phase. . . 48

3.8 Total resources used by varying D and m for K = 25. . . 49

3.9 Throughput by varying D and m for K = 25. . . 50

3.10 Clock frequency by varying D for K = 25 and m = 16. . . 50

4.1 Silhouette value for hardware and software approachs, respectively, for a Gaussian data set with K = 4. . . 54

4.2 Silhouette value for K-means hardware implementation by varying m from 8 to 32 for K = 4 and D = 2. . . 55

(14)

4.4 SSE value for K-means hardware compared to software implementations, by varying m from 8 to 32 for K = 8. . . 56 4.5 Quantization error as a function of the training epochs process for the

same network implemented in both software and hardware using Euc-lidean and Manhattan distance, respectively, in the implementations. . . . 58 4.6 Quantization error as a function of the training epochs process for the

same network implemented in both software and hardware using Man-hattan distance in both implementations. . . 58 4.7 Quantization error as a function of the training epochs process for the

same network implemented in both software and hardware using Man-hattan distance in both implementations, and varying number of bits in the hardware. . . 59 4.8 Topographic error as a function of the training epochs process for the

same network implemented in both software and hardware using Euc-lidean and Manhattan distance, respectively, in the implementations. . . . 60

(15)

List of Tables

2.1 K-means Synthesis on FPGA for g = 4, K = 4, D = 2 and m = 14. . . 24 2.2 Comparison of area overhead and processing speed for g = 8, K = 8 and

D= 1. . . 30 2.3 Comparison of area overhead and processing speed for K = 8 and K = 16. 31 2.4 Comparison of area overhead for g = 1, K = 12 and D = 2. . . 31 2.5 Comparison of area overhead and processing speed for K = 4. . . 32

3.1 Synthesis results of varying neurons for D = 2 and m = 16 . . . 48 3.2 Comparison of area occupation and throughput for K = 16 and K = 25. . 51 3.3 Comparison of area occupation and throughput for K = 25. . . 52

(16)
(17)

List of Symbols and Abbreviations

AI Artificial Intelligence

ARM Acorn RISC Machine Processor

ASAP Application-specific Systems, Archutectures and Processors

ASIC Application-Specific Integrated Circuit

BRAM Block Random-Access Memory

CAPES Coordenação de Aperfeiçoamento de Pessoal de Nível Superior

CP Clustering Process Module

CPU Central Processing Unit

CR Centroid Register Module

CUPS Connections Updated Per Second

DM Distance Metric Module

DPS Data Point per Second

DSP Digital Signal Processing

ED Euclidean Distance

FPGA Field-Programmable Gate Array

GPP General Purpose Processor

GPU Graphics Processing Unit

IDC International Data Corporation

IEEE Institute of Electrical and Electronics Engineers

IoT Internet of Things

LM Learning Memory Module

LUT Look-Up Table

(18)

MD Manhattan Distance

ML Machine Learning

NB Neuron Block Module

NP Neuron Processor

PC Personal Computer

PE Processing Element

QE Quantization Error

RC Reconfigurable Computing

RTAoSD Real-time Analytics of Streaming Data

RTL Register-Transfer Level

SED Squared Euclidean Distance

SOM Self-Organising Map

SSE Sum of Squared Error

TE Topographic Error

TPU Tensor Processing Unit

VHDL Very High Speed Integrated Circuits Hardware Description Language

WNT Winning Neuron Tree Module

(19)

Chapter 1

Introduction

Recently, modern technology allowed acquiring and storing data on several fields, e.g., Wireless Sensors Network (WSN). Proportionately, a massive amount of data is available nowadays. The exploitation of data is an important task to help in making intelligent and optimised systems. A solution that has been endorsed to develop those systems is the use of Artificial Intelligence and Machine Learning algorithms, that become an acceptable tool for data analytics (Kibria et al. 2018).

Artificial Intelligence (AI), can be defined as the branch of computer science that deals with the automation of intelligent behaviour, i.e., pattern recognition, systems optimisa-tion, robotics, etc. (Sternberg et al. 2008). Therefore, the AI has been performed through algorithms, heuristics and methodologies to solve problems based on the human brain. Over the years, with the increasing complexity of problems and the large volume of data generated in different sectors, autonomous and sophisticated tools were made necessary.

Due to that, Machine Learning (ML) tools were developed as the process of inducing hypotheses, based on data analysis. According to Mitchell (1997), ML can be defined as the ability to improve the performance of some task through experience. Hence, ML tools learn to induce a function or hypothesis capable of solving a problem from data that represent instances of the problem.

The ML tools can be segregated into two learning techniques: supervised and un-supervised. Both methods allow inferring the behaviour based on analysed data. The supervised learning enables an application to learn and improve its performance, usually through a training process with labelled data (Witten & Frank 2005), while unsupervised learning does not require a training process. Therefore, ML algorithms provide, by man-aging data, a solution for different tasks, e.g., classification and regression (MathWorks 2020).

Unsupervised Learning is closely associated with "true artificial intelligence" as it allows drawing inferences from data sets without the necessity for a training process. Thereupon, it is possible to explore "raw and unknow" data (Denny & Spirling 2017). That can be accomplished, for example, by data mining techniques. In terms of massive data sets, data mining is a powerful strategy to extract information from data in an un-supervised way, providing insights about its information (Lin et al. 2019, Kurasova et al. 2014).

Data mining can be characterised as the process of "extracting knowledge" from data, and data mining techniques are often used to identify hidden and potentially useful

(20)

in-formation in previously unknown data from large data sets, providing an ideal solution for extraction and classification (Mittal et al. 2019, Han et al. 2011, Srinivas et al. 2010). Hence, it provides a solution for different tasks, e.g., clustering.

Clustering algorithms are among the most powerful unsupervised data mining tech-niques that are usually performed automatically or semi-automatically, based on data sim-ilarity (Witten et al. 2016, Neagu et al. 2016, Oussous et al. 2017). They are used to ana-lyse non-labelled data in several applications, allowing to discover patterns and generate class labels for a group of data that are not labelled in the beginning (Han et al. 2011). These labelled groups of data are known as clusters.

In the last years, the clustering algorithms K-means and Self-Organising Map (SOM) has been often used in several applications, due to its simplicity. As an example, can be cited: image processing and segmentation, genome sequencing, real-time analytics streaming, data dimensionality reduction, etc. (Patel & Thakral 2016a, Chen, Chen, Ma & Chen 2019, Nathan & Lary 2019). Usually, the K-means outputs partitioned clusters of data, while the SOM represents the clusters through a two-dimensional map that has a topological relationship with the data.

To illustrate that, in Badawi & Bilal (2019) and Cheng & Wei (2019), the K-means was used to reduce the number of colours in images; Reza et al. (2019) applied it to label image pixels based on its colour; Chen et al. (2020), Capó et al. (2020) and Ku´smirek et al. (2019) proposed the use for extraction and representation of RNA cells or genome sequencies; while in Riaz et al. (2019) and Orkphol & Yang (2019) the K-means was applied to cluster word expressions based on its repetition intensity, to assist companies understanding the interests of users over the internet. Besides that, applications that re-quire clustering to assist in decision-making, anomaly detection and real-time analytics streaming process was proposed in the literature, as can be seen in Yang et al. (2017), Asri et al. (2019) and Ariyaluran Habeeb et al. (2019).

Similarly, there are several applications for SOMs. In Li (2020), the SOM was used to get insights into the characteristics of ring-stiffened circular cylindrical shells subject to hydrostatic pressure. In Serfontein et al. (2019), the algorithm was applied to identify possible information security risks in organisations that use Social Network Analysis. In Parchure & Gedam (2019), was proposed its use to cluster rain events on India, con-firming the spatial variation of rainfall resulting from the topography of Mumbai. Like-wise, applications for image retrieval and topological representation of data were also proposed in the literature, as can be seen in Sivakumar & Sathiamoorthy (2019), Laak-sonen et al. (2002) and Chen, Wang & Tian (2019).

However, the mentioned applications generally rely on massive data sets. In the recent years, the growth of available data volume as a consequence of the computerization in different areas, such as medicine, business, industry, telecommunication, mobile devices, Internet of Things (IoT), bioinformatics, etc., resulted in need of powerful and versatile data mining techniques to uncover valuable information (Han et al. 2011). This massive volume of stored data is known as Big Data (Chen, Oliverio, Kim & Shen 2019).

Big Data is defined by several concepts that involve volume, speed, variety and value of data, called the four V’s. In summary, it is a massive data set, with a fast rate of insertion of data of different types, in which its value depends on the relevance in the

(21)

3

application (Yaqoob et al. 2016, Chung & Wang 2017).

Analysing that huge volume of data and extracting information has become a ne-cessary decision-making process for various organisations (Koseleva & Ropaite 2017). Nonetheless, extracting information from data, which is of different types and sources, is a complex process (Yaqoob et al. 2016). Given this, organisations are facing a challen-ging scenario in terms of processing this huge volume of data, due to a greater demand for high-speed processing. Consequently, the development of computational solutions for systems operating in real-time has become a problematic task (Raghavan & Perera 2017). To depict that, systems applied to real-time analytics of streaming data (RTAoSD) can be mentioned, in which high-speed processing and low-latency is required. Commonly, the current applications are developed in General Purpose Processors (GPPs), such as Central Processing Units (CPUs) or Graphics Processing Units (GPUs) (Cappellari P. 2017, Ge et al. 2019, C.S.R. Prabhu 2019, Pyne & Rao 2016). However, the processing speed of PPGs hardly keep up with the demand growth for speed, due to its intrinsic sequential processing and the constant transfer of data between the memory and processor and, therefore, are less efficient due to their slow response capacity (Sze 2017, Kung et al. 2019).

A solution that has been widely accepted to meet the demand for high-speed pro-cessing (or high-throughput) is to devise parallel implementations of the relevant al-gorithms. Parallel execution allows different sections to operate with different sets of data concurrently (Venkatesh & Arunesh 2019). Besides, hardware implementations combined with parallelisation techniques proposed in the literature have shown satisfactory results when compared to traditional platforms based on sequential solutions (Hussain, Benkrid, Erdogan & Seker 2011a, Choi & So 2014).

The idea now is that the hardware will model itself to the algorithm and not the other way around. Previously, the computational performance of clustering algorithms with a high-parallelisation degree was jeopardised as they often have to be adapted to the tar-get hardware. On the other hand, the development of customised hardware dedicated to a specific implementation can overcome that limitation. Among the current alternat-ives for dedicated hardware, Application-Specific Integrated Circuits (ASICs), systolic arrays processors (known as Tensor Processing Unit (TPU)) and Reconfigurable Comput-ing (RC), can be highlighted (Fang et al. 2020, Khalifa et al. 2020, da Silva et al. 2020).

The RC is an emerging area that leads to the possibility of developing hardware ar-chitectures customised to the algorithm, different than the traditional model where the algorithm adjusts to the instructions of the processor, as in GPPs. Besides, the RC has also been used to implement and test architectures that are later developed in ASICs. Usu-ally, the RC is performed through Field-Programmable Gate Arrays (FPGAs). The FPGA is an array of reconfigurable logic blocks that allows the implementation of several lo-gic circuits that can operate independently, enabling parallel processing of different data simultaneously (Instruments. 2011).

The use of RC can be found in several proposals in the literature to increase the processing speed of massive data sets. Regarding the RC hardware applications of the K-means algorithm, a parallel FPGA implementation proposed by Raghavan & Perera (2017) achieved more than 300× speedup compared to GPPs. Similarly, a hardware

(22)

ac-celerator suggested by Canilho et al. (2016) reached 10× speedup against a CPU imple-mentation. Similar results are also shown for SOMs, as can be observed in Lachmair et al. (2017) where a speedup of 220× was achieved.

Therefore, the present thesis has as main objective to study hardware implementations of data mining algorithms, with emphasis on clustering algorithms, to improve the pro-cessing speed to extract information of Big Data or other applications in general. The impact related to the throughput and area occupancy is evaluated to illustrate the viability of each technique.

1.1

Motivation

The processing speed towards data extraction in Big Data is an essential factor, espe-cially given the exponential growth of data volume generated daily in some areas (Zhou et al. 2016). According to FFoulkes (2017), the analysis of data has to be fast and, if possible, in real-time, providing insights in making decisions effectively.

According to the research conducted by the International Data Corporation (IDC), it is estimated that the volume of digital data so far, since 2017, has increased by 50×, resulting in more than 4 zettabytes (FFoulkes 2017). Besides, with the advent of IoT, the growth in the volume of digital data shows no tendency to stagnate. Therefore, it is necessary to have data processing and extraction techniques that accompany this rapid growth.

More than 75% of organizations already operate in some way with Big Data, and several surveys clarify its advantages in providing the extraction of information that in-creases the precision in decision making, as described in Siddiqa et al. (2016), Zhou et al. (2016), Oussous et al. (2017),Tu et al. (2017), Akoka et al. (2017) and Jung (2017). As an example, there is the decision-making in the manufacturing industry, where the pro-duction line goes through a dynamic training and optimisation process, based on data analysis by artificial intelligence algorithms, in real-time. This recent process that has been emerging in the industrial sector is known as Industry 4.0 (Fu et al. 2018).

However, concerns over the volume of data have arisen regarding processing speed, such as the complexity of the algorithms used as data mining techniques. Clustering al-gorithms are perfomed based on metrics, i.e., the distance metric to define the nearest cluster for an input data point, further increasing this bottleneck. Therefore, the large volume of data and the complex techniques used on its analysis demands computing solu-tions to scale up computation efficiently (Nguyen et al. 2019).

In recent research, applications of data clustering algorithms developed with hardware are entirely satisfactory in the extraction of information in Big Data, due to their ability to respond quickly when compared to implementations in GPPs, where more than 300× speedup can be achieved (Raghavan & Perera 2017). The FPGA, for example, made it possible to implement these algorithms in parallel, resulting in increased processing speed and, consequently, minimising the execution time even for high computational complex-ities, as can be seen in the works implemented by Chung & Wang (2017) , Neshatpour, Malik, Ghodrat, Sasan & Homayoun (2015), Hussain, Benkrid, Erdogan & Seker (2011b) and Neshatpour, Malik & Homayoun (2015).

(23)

1.2. OBJECTIVE 5

Therefore, it is essential to study and analyse the impact of hardware implementations to increase the processing speed while extracting information from Big Data with cluster-ing algorithms. The proposal for this thesis is to present fully parallel implementations of K-means and SOM clustering algorithms for applications in several areas, with a focus on high performance, to extract the patterns between data from massive sets in a short time.

The hardware approaches most commonly found in the literature for these algorithms, including parallel implementations, generally have sequential schemes, thus limiting pro-cessing speed compared to fully parallel implementations. Therefore, the main contribu-tions of this work are:

• The proposed implementation can assist several fields, where there is a massive flow of data and restrictions on processing time since it reaches high-speed processing. Consequently, making it possible to process data in real-time for some applications; • A full parallel hardware implementation without additional embedded processors

or software;

• A thorough speed and area occupation analysis based on the post-synthesis results concerning the hardware;

• Synthesis and analysis of the architecture for three different distance metrics and fixed-point sizes for assisting future implementations in the selection of the best metrics for a specific application;

• A new method for transcribing an algorithm into the hardware, through the use of register-transfer-level(RTL) implementation.

The Virtex-6 xc6vlx240t-1ff1156 FPGA was used to design the hardware approach proposed in this work, for the mentioned algorithms. The results obtained about the con-tributions mentioned above are presented, to prove the viability of this thesis proposal. Afterwards, the design can be replicated and/or implemented in ASICs. To validate and evaluate this proposal, it was developed in RTL, allowing to develop the system architec-ture in a totally parallel and more straightforward way, and still, keep track of low-level settings. Also, FPGAs are widely used to implement these algorithms in parallel, as pro-cessing time and cost are significantly reduced (Tirumalai et al. 2007a).

1.2

Objective

There are several ways to design hardware approaches for clustering algorithms, and most forms are between full and partial parallelisation. With full parallelisation, the pro-ject can reach the upper-speed limit. The upper and lower limit values for performance parameters (processing speed and hardware area occupation) depend on the technique and the target algorithm. However, based on the works in the literature, gains between 5 to 1, 000× (in the parameters that will be studied) are expected when comparing custom hardware with GPPs (Coutinho et al. 2019, Lopes et al. 2019, Noronha et al. 2019).

Therefore, the main objective of this work is to design a reference dedicated hardware for data clustering algorithms in a completely parallel way to increase the processing speed, since the data present in Big Data are important in helping decision-making process

(24)

for different organizations. The employment of dedicated hardware, enhancements the high quality of the resources and time constraints management (Shoeibi et al. 2019).

As previously mentioned, the K-means and SOM algorithms are widely used in sev-eral applications and areas due to their ability to extract information in an unsupervised way, its simplicity and flexibility for parallel implementations, being among important clustering algorithms. Hence, they were developed and executed in hardware to observe their performance, mainly about the thorough speed and area occupation.

To achieve that, the following steps will be performed:

• The hardware implementation of K-means and SOM clustering algorithms. An FPGA design is developed to validate and evaluate the proposal;

• Studies regarding the computational complexity of each mentioned algorithm, ap-plied to the analysis of massive data sets, providing a direction in the choice of these algorithms parameters according to the application;

• The impact regarding the throughput and area occupation resulting from the imple-mentations, providing a feasibility analysis;

• Conducting studies regarding the increase in the processing speed of massive data sets;

The designed hardware approach for the K-means and SOM algorithms is described in the Chapter 2 and Chapter 3, respectively. Therefore, in each chapter is presented the description of one of the algorithms, related works in the literature about hardware imple-mentations, and the achieved results concerning the two key metrics: hardware throughput and area occupation.

1.3

CAPES PrInt

This thesis proposal was awarded a scholarship funding by the Coordenação de Aper-feiçoamento de Pessoal de Nível Superior (CAPES) - Finance Code 001, under the CAPES PrInt program, for a period of 9 months. Hence, part of this research was carried out in the Centre for Data Science at the Coventry University, Coventry, United Kingdom.

As a result, an article was published and presented at the IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP), in partnership with Professor Elena Gaura. Besides that, contributions were also provided on ongoing experiments on optimisation of energy distribution in island microgrids. In addition, the partnership strengthened cooperation between the Federal University of Rio Grande do Norte and Coventry University.

1.4

Published papers

The hardware approaches proposed in this thesis has been published by the authors in journals and conferences. The proposed design for the K-means can be found in the IEEE Access journal, and one design of SOM can be found in the proceedings of the 31st IEEE

(25)

1.5. THESIS ORGANIZATION 7

International Conference on Application-specific Systems, Architectures and Processors. The references for the papers are:

L. A. Dias, J. C. Ferreira and M. A. C. Fernandes, "Parallel Implementation of K-Means Algorithm on FPGA," in IEEE Access, vol. 8, pp. 41071-41084, 2020, doi: 10.1109/ACCESS.2020.2976900.

L. A. Dias, M. G. F. Coutinho, E. Gaura and M. A. C. Fernandes, "A New Hard-ware Approach to Self-Organizing Maps," 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP), Manchester, United Kingdom, 2020, pp. 205-212, doi: 10.1109/ASAP49362.2020.00041.

1.5

Thesis organization

This work is organised as follows: In this first chapter, an introduction and contextu-alisation, motivation, theoretical framework, objective of the proposal and contributions of the work were presented as well as published papers. Chapter 2 presents the K-means algorithm, related works in the literature about different hardware designs, a detailed de-scription of the architecture proposed in this work and the results achieved regarding the mentioned objective, while Chapter 3 similarly presents the SOM algorithm. Meanwhile, the Chapter 4 presents the study about the quality of clusters obtained with the proposed hardware approaches. Finally, Chapter 5 gives the discussions and conclusion for the obtained results with this thesis proposal.

(26)
(27)

Chapter 2

K-means algorithm

In recent years, technological advances of digital devices resulted in a significant in-crease in the amount of digital data processed and stored, which in turn are generated in a variety of fields, including health, traffic, climatology, mobile devices, and social net-works (Yaqoob et al. 2016, Ayani et al. 2019). Analysing this massive amount of data and extracting relevant information has become an essential process in decision making for various organisations in several areas such as finance, banking, healthcare, and commu-nication (Koseleva & Ropaite 2017). Therefore, as previously mentioned, organisations have been facing a challenging scenario when processing such massive amounts of data due to increased demand for results in shorter time frames (Raghavan & Perera 2017).

Parallel hardware implementations have been widely adopted to meet the demand for high-throughput (Hussain, Benkrid, Erdogan & Seker 2011a, Venkatesh & Arunesh 2019, Hussain, Benkrid, Erdogan & Seker 2011a, Choi & So 2014). The K-means al-gorithm is commonly used in the process of data clustering because it allows finding data patterns based on its similarity, in an unsupervised way; in addition, it is a relatively simple technique and allows a high parallelism of its processes (Witten et al. 2016, Patel & Thakral 2016a, Bahmani et al. 2012).

This chapter is dedicated to the introduction of the K-means algorithm and its hard-ware implementation, proposed here with general applicability to situations where large data amounts must be processed under strict time restrictions. The impact of its para-meters is analysed regarding two key metrics: throughput and area occupation. Also, the impact of the most common distance metrics adopt for K-means implementations is carried out.

The remainder of this chapter is organised as follows: section 2.1 presents an overview of the most relevant related works found in the literature, while 2.2 explains the K-means algorithm and its operation. Meanwhile, section 2.3 shows a detailed description of the designed architecture proposed in this paper, while section 2.4 describes the results about the mentioned metrics. Finally, section 2.5 gives the discussions about this chapter.

2.1

Related works

In this section, an overview of the most relevant related works found in the literat-ure is presented, with an emphasis on parallel and FPGA implementations, whose main

(28)

objective is to speed up data processing by reducing processing time. In the last years, applications of K-means algorithm involving parallel and distributed, hardware only and hybrid (software and hardware) implementations have been reported in the literature.

The K-means algorithm has long been used for applications in the most diverse areas, such as image processing, audio recognition, biomedicine, pharmacy and so on, to find patterns and recognise data. Parallel implementations of the K-means algorithm began to appear in the literature more than a decade ago, as can be seen in Stoffel & Belkoniene (1999). In its proposal, which was implemented in software, the data set is distributed among several computers (PCs) to perform the grouping of data in parallel and, after that, the centroids are updated on a single PC according to the grouping realised on the others. As a consequence of the parallelisation of the algorithm grouping process, a reduction of the processing time was achieved compared to a completely sequential implementation, reaching an efficiency of up to 90%. Similarly, an implementation also using multiple PCs was proposed by Kraj et al. (2008), reducing the processing time up to 87%.

Proposals for software implementations are usually based on the use of GPPs as CPUs and GPUs. However, due to the possibility of parallel processing through its coprocessors, GPUs became the leading chosen platform to develop data clustering algorithms in soft-ware. As an example, Farivar et al. (2008), proposed a K-means approach where the similarity and grouping processes were carried out in parallel, while the centroid updat-ing process was performed in a serial manner. An increase in processupdat-ing speed (speedup) of 13× has been achieved compared to a completely sequential implementation.

Similarly, there is a proposed approach by Zechner & Granitzer (2009), where the GPU coprocessors are used as threads, which in turn are responsible for performing the similarity and grouping processes and, therefore, allowing parallel processing of these steps. In contrast, the centroid updating process is performed sequentially in one dedic-ated thread. A speedup of 14× was obtained for the parallel implementation regarding the sequential scheme.

A more recent example is the approach suggested by Saveetha et al. (2018), in which the K-means algorithm is implemented to group genes and find expression patterns among them. A distributed implementation was developed in a CPU and GPU, in parallel, and compared to the same implementation performed sequentially. The parallel implementa-tion resulted in a speedup of up to 7×.

Similar to Saveetha et al. (2018), in Lutz et al. (2018), an implementation of K-means operating on CPU and GPU was also proposed, but for general purpose applications. Aiming to reduce processing time, the frequency at which data is exchanged between CPU and GPU has been reduced, resulting in a speedup of up to 18.5× compared to implementations where there is a constant data exchange between CPU and GPU.

However, the frequent communication between devices becomes a bottleneck for these distributed implementations, limiting the processing speed (throughput). Besides, sequential processing, such as accessing memories, is inherent to the use of CPUs and GPUs, restrict the processing speed of Big Data information, especially in real-time ap-plications.

Thus, given the need for increasingly faster analysis techniques, there has been an increase in the number of data clustering techniques implemented in hardware, given

(29)

2.1. RELATED WORKS 11

that with the advent of RC devices, such as FPGA, processing time and costs have been significantly reduced (Tirumalai et al. 2007a).

One of the first implementations of K-means in FPGA was carried out by Leeser et al. (2000), to analyse the tradeoff between different metrics for calculating similarity. The design was developed for applications in general.

Subsequently, FPGA implementations also emerged for specific applications. An ex-ample is an implementation proposed by Saegusa & Maruyama (2007), in which the al-gorithm was developed to segment the colour of images in real-time. In this proposal, the similarity and grouping processes are carried out in parallel while the centroid updating process is carried out sequentially, similar to the implementations proposed in GPU.

The parallel implementation of data clustering techniques in hardware is limited by the area resources available, so methods have been proposed seeking to increase the pro-cessing speed and at the same time maintain control over the area’s occupation.

In Hussain, Benkrid, Erdogan & Seker (2011a), a parameterised K-means algorithm was fully implemented on FPGA, in a CPU and a GPU, to compare the speedup between each implementation. In the case of FPGAs, the input data points and the centroids are stored in memories, which can be internal or external. To reduce the hardware area occu-pation, the insertion of input data (present in the data set) in the algorithm and the centroid updating process, is implemented serially. On the other hand, to increase the through-put, the distance metric that defines the similarity between the input data point about the clusters, is implemented on a parallel scheme. The FPGA implementation achieved a speedupof about 6.7× in comparison to a GPU-based one; the speedup over a CPU-based implementation was 54×. However, in addition to sequential steps, the constant accesses to memory for writing and reading input data points and centroids limits and reduce the performance of the implementation concerning processing speed.

In Choi & So (2014), the K-means algorithm has been implemented in different FP-GAs using MapReduce model to compare the performance regarding speedup. The sim-ilarity distance metric adopted is based on Euclidean distance, and the map function per-forms the clustering process. Therefore a key-value pair list containing the input data point and its nearest cluster centroid are generated. This list is then sorted based on centroid values by a shuffling function, allowing to group the data according to their clusters. Afterwards, the reduce function is performed in a parallel scheme, updating all centroids simultaneously. This proposal was implemented in two FPGAs dedicated to the map function (called mappers) and one dedicated to reducing function (called redu-cer). The FPGAs are Xilinx Kintex-7 XC7k325T devices, and the tests presented were performed for 32 mappers and 12 reducer functions. The proposed design has a speedup of about 20.6× compared to the same implementation in software, despite the bottleneck caused by the communication between the FPGAs. The resource usage for both mappers is 36% of the registers, 69% of the Look-Up-Tables (LUTs), 49% of the DSP and 33% of block Random-Access Memory (RAM), and for the reducer 44% of the registers, 71% of LUTs, 34% of the DSP and 24% of block RAM. However, the processing speed is restric-ted by the communication between different FPGAs and the constant access to memory blocks, in which, as already mentioned, is sequential. Also, the process of grouping data performed based on the map function is intrinsically serial.

(30)

A hybrid implementation is proposed in Canilho et al. (2016) to compare the spee-dup regarding an ARM processor. A hybrid implementation executes only part of the algorithm steps on hardware and the rest on software. This hybrid proposal aims to re-duce FPGA area overhead and improve processing speed. The FPGA contains the circuits to calculate the similarity distance metric, which is based on the Manhattan distance, and the circuits to group input data points with the nearest centroid. The software is respons-ible for updating the centroid values, for avoiding creating division circuits, needed for this process, on the FPGA, thus reducing the area overhead. A 32-bit floating-point rep-resentation for input data points is also used to provide a high resolution. The resulting area overhead for the FPGA is 11, 560 LUTs, 11, 171 registers, 10 RAM blocks (BRAM) and 28 DSPs. A speedup of about 10× over the software version running on the ARM processor is achieved. However, the area overhead is relatively high, considering that only part of the K-means algorithm is running on the FPGA. Besides, a bottleneck caused by the communication between FPGA and the ARM processor limits the processing time.

In the work presented in Raghavan & Perera (2017), the K-means algorithm is com-pletely developed on FPGA and GPP. External memories are used to store the data set and initial centroid values, and internal block memories are used to store each input data point and centroid for processing. This implementation allows several input data points to be processed in parallel, but the similarity distance metric to each centroid is obtained se-quentially. Therefore, the processing speed is reduced when compared to a fully parallel implementation. The implementation uses 33% of the Virtex-6 xc6vlx240t total resources for a speedup of about 368× over the GPP version. In addition to the sequential steps, there is a communication bottleneck caused by frequent memory accesses.

In the work detailed in Chung & Wang (2017), like in Choi & So (2014), the K-means algorithm is implemented with the aim of accelerating Hadoop clusters. A hybrid archi-tecture is also proposed. The similarity distance metric is implemented on FPGA using the Euclidean distance while the centroid updating is implemented in software. The hy-brid implementation was chosen to reduce FPGA area overhead. A 4× speedup has been obtained compared to using Apache Mahout Machine Learning Libraries, a distributed linear algebra framework library written in Scala to assist in the implementation of the K-means (Mahout 2018).

Similar to Choi & So (2014), the work presented in Li et al. (2016) also developed an implementation based on the MapReduce model. The K-means algorithm is completely developed in a Zynq xc7z045ffg600-2 FPGA and allows input data points to be processed in parallel. The data set is split according to the number of mapper circuits, where each mapper is responsible for processing a data set slice. The mappers are responsible for ob-taining the distance metric and assigning input data to the nearest cluster. In addition, one reduce circuit is used to update all the centroids. The implementation achieved a through-put of 28.74Gbps and occupied 47.61% and 81.51% of registers and LUTs, respectively.

Therefore, based on those papers, it is clear that the use of hardware to accelerate the processing speed of the massive volume of data is feasible and effective when compared to general-purpose implementations or sequential systems. However, as mentioned, most implementations have sequential steps and frequent accesses to memories, which can limit their use in real-time applications. Hence, this paper proposes a parallel implementation

(31)

2.2. THE K-MEANS ALGORITHM 13

of each k-means process, making real-time applications possible and reliable.

2.2

The K-means algorithm

K-means is an unsupervised algorithm for grouping data; this is, data sets are parti-tioned and grouped based on similarity metrics. A group is called a cluster, and a simil-arity metric can be an equation that calculates the distance between two points, where the closer they are, the greater is the similarity between them. Therefore, this is an iterative algorithm that aims to generate K clusters, assigning data from a data set to the closest representative data, which in turn is called the centroid, according to the distance between them (Patel & Thakral 2016a).

The cluster number, K, is an integer, and for each k-th cluster, a centroid, ck, is

as-signed. The initial value of each k-th centroid, is randomly generated by choosing random input data points in the data set, which is the most used way as can be seen in the following proposed papers: Neshatpour et al. (2016), Arora et al. (2016), Hussain, Benkrid, Seker & Erdogan (2011) and Neshatpour, Malik & Homayoun (2015). It can also be generated by other algorithms (Bahmani et al. 2012).

In this thesis, the set of centroids, C[m], is randomly initialised and can be described as

C[m](n) = [c1[m](n), c2[m](n), . . . , cK[m](n)] (2.1)

where m represent the number of bits that describes each centroid and n means the n-th iteration.

In every n-th iteration of the algorithm, the similarity between a cluster centroid and an input data point, pj[m], of a data set, X[m], is obtained. A data set, X[m], of J input

data points, can be represented as

X[m] = [p1[m](n), p2[m](n), ..., pJ[m](n)]. (2.2)

That similarity of a k-th cluster, which is represented by its centroid, ck[m], and a

j-th input data point, pj[m], is defined based on the distance between them. Therefore, a

distance metric determines to which centroid the input data point is assigned. Afterwards, the centroid will be updated with the mean value of all input data points assigned to it. The process is then repeated, but the distance is now obtained in regard to the new centroid value (after updated) until their values do not change or a predefined number of iterations has been performed.

The Algorithm 1 presents the K-means pseudocode. This code details all the variables and procedures that will be used in the implementation to be presented in the following sections. It starts by randomly generating the first set of centroids, C[m], as shown in equation 2.1. Therefore, one centroid of m bits for each k-th cluster, as shown in line 1.

As can be seen from line 3, at each n-th iteration, the distance of every j-th input data point, pj[m], in relation to each k-th centroid, ck[m], is calculated. This distance for

(32)

Algorithm 1 K-means pseudocode

1: Initialise C[m] centroids randomly;

2: while C[m](n + 1) 6= C[m](n) do

3: for j ← 1 to J do 4: for k ← 1 to K do

5: Compute the distance dk(pj[m], ck[m])(n) according to equation 2.3 or

equation 2.4; 6: d(n) ← dk(pj[m], ck[m])(n); 7: end for 8: for k ← 1 to K do 9: if d(k − 1) ≤ d(k) then 10: ck← pj[m]; 11: end if 12: end for

13: Update ckaccording to equation 2.6;

14: end for

15: n← n + 1;

16: end while

equation, which is defined as follows

dk(pj, ck) = ( D

i=1

kpj,i[m] − ck,i[m]kr)1/r (2.3)

where D represent the number of data dimension/attributes, and r defines which distance metric is used. For r = 1 Manhattan distance is obtained and for r = 2 the Euclidean distance.

It is also common to adopt the Squared Euclidean distance to avoid the complexity of a square root function required by Euclidean distance (r = 2) and maintain an accurate results compared to Manhattan distance. The Squared Euclidean distance is derived from Euclidean distance, and defined as follow

dk(pj, ck)2= D

i=1

kpj,i[m] − ck,i[m]k2. (2.4)

At each n-th iteration, a vector of distances, d(n), stores the distance of a j-th input data point, pj[m], in relation to each centroid, ck[m], and it is represented as

d(n) = [d1(pj[m], c1[m])(n), d2(pj[m], c2[m])(n), . . . , dK(pj[m], cK[m])(n)] (2.5)

where dkis the k-th distance obtained according to equation 2.3 or equation 2.4.

After calculating the distance, the input data point, pj[m], is assigned to the k-th cluster

with the closest centroid, ck[m], in other words, the input data point is assigned to the cluster centroid with the minimum distance present in the vector of distances, d(n), as

(33)

2.3. IMPLEMENTATION DESCRIPTION 15

can be seen in lines 8 to 12 in the Algorithm 1. According to Patel & Thakral (2016a) and Kakushadze & Yu (2017), the most used distance metric is Euclidean distance, shown in equation 2.3 for r = 2, because it provide more accuracy compared to Manhattan dis-tance (Estlick et al. 2001).

Lastly, each centroid, ck[m], present in the set of centroids, C[m], is updated with the

mean value of all input data points assigned to it, according to the following equation

ck[m] = 1 Z Z

w=1 pj,w[m] (2.6)

where Z represent the total amount of input data points in that cluster, this is, assigned to this centroid. The process is then repeated in the next n-th iteration if the new centroid values, C[m](n + 1) are different from the actual values, C[m](n), as can be seen in line 2. It is noticeable in Algorithm 1 that the larger the data set and the centroid number, the greater is the number of iterations. Thus, when applied to massive data sets, the number of iterations is very high, resulting in a high computational complexity, mainly due to the calculation of the distance metric. Hence, it is clear the need for high-speed processing.

2.3

Implementation Description

The entire K-means algorithm, presented in the Algorithm 1, was developed using a parallel architecture focusing on accelerating the data processing speed regardless of the data set, taking advantage of the available FPGA hardware resources, similarly to Nedjah & de Macedo Mourelle (2007a). It is shown in the Figure 2.1, the general architecture of this proposal. The figure details in block diagram the main modules of the proposed im-plementation, which in turn were encapsulated in order to make the general visualisation of the architecture less complex.

The algorithm flowpath of this implementation is organized in four different main modules, as shown in Figure 2.1, where each of them represents a K-means process. Firstly, the Centroid Register (CR) module stores every cluster centroid, ck[m], of the set, C[m]. The Distance Metric (DM) module is responsible to define the similarity by calculating the distance between each j-th input data point, pj[m], to each k-th centroid,

ck[m], while the Clustering Process (CP) module defines to which centroid the input data point is assigned. Lastly, the Mean Centroid (MC) module update the value of each k-th centroid in the set, C[m].

To be fully implemented in parallel, the architecture proposed here is replicated ac-cording to a parallelisation degree, called here as g. This parameter allows a total of G input data points, pgj[m], to be entered simultaneously, wherein g = 1, 2, . . . , G, as can be observed in the Figure 2.1. In order to process these g input data points, the DM and CP modules are replicated G times. Therefore, the algorithm flowpath is executed for G different input data points simultaneously. It is important to emphasize that this im-plementation can be replicated for several input data points to be processed in parallel. In addition, note that the implementation is scalable, in other words, every circuit and parameter of the implementation can be replicated, limited only by the resources of the

(34)

CR ( [m], [m])(n) d1 p1j c1 c[m](n) c[m](n+1) ... DM1 CP1 (n) v1 1 ... ... DM2 CP2 ... ... DMG CPG ... MC ... [m](n) p1 j,D ... ... [m](n) p2 j,D [m](n) pG j,D ( [m], [m])(n) dK p1j cK ( [m], [m])(n) d1 p2j c1 ( [m], [m])(n) dK p2j cK ( [m], [m])(n) d1 pGj c1 ( [m], [m])(n) dK pGj cK (n) v1 K (n) v2 1 (n) v2 K (n) vG 1 (n) vG K

Figure 2.1: General architecture of the proposed parallel K-means algorithm implement-ation.

used hardware. Hence, this proposal can be used to process any Big Data, this is, any data dimensions/attributes, cluster centroids, etc. required for a data set, according to the resources available in the hardware.

The initial centroid values present in the set, C[m], randomly chosen, according to the Algorithm 1, are generated outside the FPGA and stored in CR module through Ethernet Gbit, and then updated by the mean value of the input data points nearby. This update process occurs at every n-th iteration. When the centroids values do not change, the al-gorithm stops running to indicate that all input data points are assigned to their respective cluster. The runtime of the algorithm can also be stopped by a predetermined number of iterations.

Those modules shown in Figure 2.1 are made up of submodules, that has its specific implementations, also in parallel, that will be detailed in the next subsections.

2.3.1

Distance Metric Module

A Distance Metric (DM) module, has the purpose of calculating the distance of a j-th D-dimensional input data point, pgj,D[m], to each k-th centroid, ck,D[m], present in the set,

C[m], at each n-th iteration, to indicate their similarity. That is the first K-means process performed after initialising the centroids.

It is shown in Figure 2.2, how each g-th DM module is built. In order to calcu-late that distance metric, according to equations 2.3 and 2.4 mentioned in section 2.2, this module is composed by the following submodules: subtractors (SU Bk,D),

multipli-ers (MU LTk,D), absolute (ABSk,D), adders (SU Mk,D), square root functions (SQRTk) and

multiplexers (MU Xk). As the purpose of this proposal is a completely parallel

implement-ation, in addition to replicating this module G times, to obtain the similarity for G input data points simultaneously, its submodules are also replicated to obtain the distance of

(35)

2.3. IMPLEMENTATION DESCRIPTION 17

a j-th input data point, pgj[m](n), to each k-th centroid, ck[m](n), in parallel, this is, in

only one iteration. Hence, these submodules are replicated according to the number of centroids and dimensions, as can be seen in Figure 2.2.

( [m], [m])(n) d1 pgj c1 MUX1 ... ... CS1 SUB1,1 MULT1,1 SUM1 SQRT1 SUB1,D MULT1,D SUBK,1 MULTK,1 SUMK SUBK,D MULTK,D MUXK SQRTK [m](n) c1,1 [m](n) pgj,1 [m](n) pgj,D [m](n) c1,D [m](n) cK,1 [m](n) cK,D ( [m], [m])(n) dK pgj cK ABS1,1 ABS1,D SUM1 ABSK,1 ABSK,D SUMK CSK

Figure 2.2: A g-th distance metric (DM) submodule for a j-th input data point, pgj[m], and Kcentroids, ck[m].

Firstly, for each data dimension/attribute, D, the submodule SU Bk,Dsubtract a centroid

value, ck,D[m], from a input data point value, pgj,D[m]. The result generated is multiplied

by itself in the subsequent submodule MU LTk,D, and it is also obtained the absolute value

in ABSk,D submodule. Each MU LTk,D and ABSk,D value is then summed by the

submod-ules SU Mk. Lastly, the submodule MU Xk is used to define which equation should be

adopted. As can be observed in the Figure 2.2, according to the position of the mux data selector (CSk), the Manhattan distance, defined in the equation 2.3 for r = 1, the Euclidean distance, defined in the equation 2.3 for r = 2, or the Squared Euclidean distance, defined in the equation 2.4, is performed. The submodules ABSk,Dand SU Mk are used to perform

Manhattan distance, while Euclidean distance is performed by SQRTk submodule path, and the submodules MU LTk and SU Mkare used for Squared Euclidean distance.

The resultant vector of distances, d(n), shown in equation 2.5, is then obtained for each k-th cluster, in parallel, at each n-th iteration. Hence, K distances, dK(pgj[m], cK[m]),

are genetared simultaneously for each input data point, pgj[m].

In order to estimate the scalability, the total amount of each submodule, necessary to perform this step, shown in Figure 2.2, can be defined by the cluster and dimension

(36)

number. Thus, the amount of subtractors, multipliers and absolute is defined as

totalSU B,MU LT,ABS= K × D (2.7)

while the total amount of adders is

totalSU M = (2 × (D − 1)) × K (2.8)

The number of multiplexers and square root function submodule is defined according to the centroid number, so a total of K SQRTk is created. Note that SU Mk is created

only if there is more than one dimension/attribute, in other words, this submodule is not necessary for D = 1.

This module and its submodules are implemented in fixed-point to reduce the number of bits (m) compared to floating-point implementations. Fixed-point implementation of circuits such as adders, SU Mk, and absolute, ABSk, increases only 1 bit in their output

total number of bits, so its output size has been set to full. Meanwhile, the multipliers submodules, MU LTk,D, can double the number of bits, thus the size of its output was

limited to increase the number of bits only by 2. As the input data points used are norm-alised between 0 and 1, this setup does result in truncate or rounded unexpected values. In addition, in case Euclidean distance is chosen, the data is converted to floating-point to execute the operation in SQRTksubmodule in only one iteration and then converted again

to fixed-point. Thereby, this submodule does not increase the number of bits. Hence, as can be observed in the Figure 2.2, the bit-width, m, increases only 2 bits for Manhattan distance and 3 bits for Euclidean and Squared Euclidean.

Each k-th distance, dk, is then passed to next process, the Clustering Process, to de-termine which cluster the input data point should be assigned. This DM module was developed for those three different distance metrics in order to analyse the the tradeoff between them, and the complexity of SQRTk submodule concerning to consumption of area and processing speed.

2.3.2

Clustering Process Module

A Clustering Process (CP) module, has the purpose of assigning each j-th input data point, pgj[m](n), to the closest k-th cluster centroid, ck[m], at each n-th iteration, based on

the distance vector, d(n), shown in equation 2.5, generated in DM module. As the cluster number is predefined, the condition to realise this process is K ≥ 2, otherwise, the entire data set will be in the same cluster, and this module is not required.

This assignment task is realized by comparing the distance values, dk(pgj[m], ck[m]), received from the previous submodule. In order to realize that, this module is composed of the following submodules: comparators (COMPk), logical OR gates and multiplexers

(MU X ). Since this implementation is completely parallel, CP module is replicated G times to assign G input data points, pgj[m](n), to their respective cluster simultaneously, as can be seen in Figure 2.1. In addition, its submodules are also replicated in order to compare every k-th distance in the vector of distances (d(n)), in parallel, obtaining the lowest in only one iteration.

(37)

2.3. IMPLEMENTATION DESCRIPTION 19

As can be observed in the Figure 2.3, at each n-th iteration, each submodule COMPk

receives as input all K distances values (dk), related to each k-th centroid, ck[m]. Then it

checks if its k-th distance, this is, the distance present in the first input, is lower than the others by comparing them.

Each k-th COMPk has an output value represented by vgk, which in turn is a boolean value, and defined as

vgk(n) = 1, if dk≤ dj, ∀ j, 1 ≤ j ≤ K where j 6= k. 0, otherwise.

where vgk indicates that pgj[m](n) is close to its respective k-th centroid when it assumes the bit 1 value. When pgj[m](n) is not close to the k-th centroid distance of COMPk, vgk

assumes a bit 0 value.

COMP1 COMP2 COMPK ( [m], [m])(n) d1 pgj c1 MUX MUX ... ... ... (n) vg2 ... 0 0 (n) vg1 (n) vgK ( [m], [m])(n) dK pgj cK ( [m], [m])(n) dK pgj cK ( [m], [m])(n) d2 pgj c2 ( [m], [m])(n) dK pgj cK ( [m], [m])(n) dK−1 pgj cK−1

Figure 2.3: A g-th clustering process (CP) submodules for K centroids.

Thereby, suppose a data set that needs to be grouped into K clusters, were c1[m] to

cK[m] are their respectively centroids. Considering the distance of a j-th input data point, pgj[m], regarding to first centroid being the shortest distance, this is, d1(pgj[m], c1[m]) <

dk(pgj[m], ck[m]), ∀k, 1 ≤ k ≤ K and k 6= 1, the output of COMP1, vg1[m], is set to bit 1.

Hence, the remaing comparators outputs, vgk[m], is set to bit 0 through the MUXs and OR gates.

In case a input data point is equally distant between two centroids, the comparator output that has the lowest k-th index will be assigned to bit 1, and a bit 0 will be assigned to the remaining. Therefore, at each n-th iteration, each g-th CP module generates K boolean values, vgk[m], as shown in Figure 2.1, in which just one of them is set to 1, while the remaining is equal to 0. This helps the next process, Mean Centroid, recognise which cluster the input data point belongs.

The total number of each submodule, necessary to define which centroid the input data point is closer, as shown in the Figure 2.3, can be defined based on the cluster number. Therefore, the amount of comparators, COMPk, is equal to K, while the amount of MU X ,

(38)

is defined as

totalMU X = K − 1 (2.9)

and the number of logical OR gates, in its turn, is defined by

totalgates= K − 2 (2.10)

This module was developed using only logical submodules, thus it operates with boolean values, which requires only one bit, and do not increase the total number of bits as the previous module.

2.3.3

Mean Centroid Module

The Mean Centroid (MC) module, is responsible to update every centroid present in the set of centroids, C[m], by calculating the mean value of all input data points, pgj[m], assigned to a determined centroid, ck[m], at each n-th iteration, as can be seen in line 13

of Algorithm 1.

As can be observed in Figure 2.1, the MC module, different from others, is not rep-licated. It receives as input, all G input data points, pgj[m], and each k-th output of every g-th CP module, vgk(n). However, its submodules are replicated in order to update each centroid, ck[m], in parallel, as shown in Figure 2.4. As can be seen, for each k-th centroid of the set C[m], there is an update centroid submodule, UCk. Every k-th submodule is

responsible to update its k-th centroid, ck[m], were each UCk receives a total of G input

data points, pgj[m], and CP outputs, vgk(n), regarding its respective k-th centroid.

UC1 c1[m](n + 1) [m](n + 1) cK [m](n) p1 j [m](n) pG j (n) v1 1 ... ... UCK ... ... ... (n) vG 1 (n) v1 K (n) vG K

Figure 2.4: Mean Centroid (MC) submodules for K centroids.

As mentioned in section 2.2, the mean value is obtained according to equation 2.6, so the update centroid submodule, UCk, consists of following circuits: adders (called here as SU M − Nk and SU M − Dk), accumulators (called here as ACC − Nk and ACC − Dk),

(39)

2.3. IMPLEMENTATION DESCRIPTION 21

a divisor (DIVk), a register (REGk), a comparator (COMPk) and multiplexers (MU X ), as shown in Figure 2.5. The suffix −N, in the adder and accumulator, indicate that these circuits are used to generate the divisor numerator, and the suffix −D is used to indicate these circuits are used to generate the divisor denominator. Hence, the divisor numerator is generated by SU M − Nkand ACC − Nkcircuits, respectively, and the divisor denomin-ator is generated by SU M − Dkand ACC − Dk. These circuits are also replicated for each data dimension/attribute, D. SUM­Nk SUM­Dk 0 [m](n) p1 j,D [m](n + 1) ck MUX ACC­Nk ACC­Dk DIVk 0 [m](n) pG j,D MUX REGk MUX ... ... (n) v1 k (n) vG k COMPk 0

Figure 2.5: Circuits that constitute the k-th update centroid (UCk) submodule.

At each n-th iteration, the values of vgk(n) are summed in SU M − Dk submodule and accumulated in ACC − Dk, generating the denominator. These values are also used as data selector of the first MU X s, controlling the input data points being summed in SU M − Nk and accumulated in ACC − Nkto form the numerator, as can be observed in Figure 2.5.

As mentioned in the previous subsections, when a input data point, pgj[m], is assigned to a centroid, ck[m], the input vgk(n) assumes the bit 1 value. Therefore, if vgk(n) = 1, the

g-th input data point at that n-th iteration, is summed in SU M − Nk of this k-th centroid. In addition to that, SU M − Dk sums the G values vgk(n), to determine the input data point number present in the cluster. The result of the adders is then accumulated by ACC − Nk and ACC − Dk, respectively. Note that for vgk(n) = 0, this is, when there are no input data points close to that centroid, the value in both accumulators does not change as both adders receive only zeros.

After each j-th input data point present in the data set has been entered in the al-gorithm, a division operation is performed by the circuit DIVk, based on the values of both accumulators, generating the new centroid value, ck[m](n + 1). This new centroid value is then stored in the register REGk, and sent to Centroid Register module, through the last MU X .

If there aren’t any input data points assigned to the centroid, this is, when SU M − Dk= 0, the new value, ck[m](n + 1), is defined by the REGk previously stored. That is

accomplished through the comparator, COMPk, used as a data selector for the last MU X , which checks if SU M − Dk6= 0 to propagate the division result; otherwise, it propagates REGk, which in turn has the initial centroid, ck[m], randomly generated, as its initial value.

(40)

The numerator and denominator values are internally converted to floating-point in order to realise division operation in just one iteration, and to not increase the number of bits.

The total number of each circuit, necessary to perform this step, shown in Figure 2.5, can be defined by the cluster number, K, and dimensions, D. The number of SU M − Nk, SU M− Dk, ACC − Nk, ACC − Dk, DIVkand REGkis defined as

totalcircuits= K × D (2.11)

while the amount of MU X necessary can be defined as

totalMU X = G × D + D (2.12)

and the amount of comparators is equal to cluster number, this is, totalCOMP= K.

2.3.4

Centroid Register Module

The Centroid Register (CR) module, is used to store each k-th centroid, ck[m](n),

present in the set of centroids, C[m]. Not that, as shown in Figure 2.1, this module is not replicated. As can be seen in Figure 2.6, this module is constituted by a set of m-bit registers, in which for each dimension/attribute (D), of every k-th centroid, there is a register. Hence, the number of registers required is defined as

totalREG= D × K. (2.13)

...

...

...

Figure 2.6: Set of registers that constitute the CR module, used to store each D-dimension of all K centroids, ck[m](n).

The initial value of each k-the centroid, ck[m](n), plays a fundamental role in the

resultant clusters generated by the K-means algorithm. In the tests performed with the proposed implementation, they were randomly generated as shown in the line 1 of Al-gorithm 1. As mentioned before, these values are generated outside the FPGA and stored

Referências

Documentos relacionados

Assessment was made on the quality of images produced by the Sobel, Laplacian and Canny edge detection methods to evaluate if the parallel implementation of the algorithms

4 The structure of an ERP system using parallel algorithms and parallel calculus “An enterprise resource planning (ERP) system is an attempt to create an integrated product

Portanto, o papel de garante da segurança não incumbe tão só ao Estado, através das forças e serviços de segurança, mas também “à sociedade em geral

In this context, the main objective of the research is to observe the process of implementation of a Warehouse Management System and analyze the obtained results based on the

Assim, a realização desta pesquisa se justifica: • pela importância do conhecimento de expressões idiomáticas, tanto para a comunicação quanto para o processo de

Para se obter um nível de qualidade excelente, tanto a estrutura, quanto o processo e o resultado não devem ser considerados como elementos autônomos, pelo contrário, eles

Referring to the results of this study, it can be stated that the system of electroplated copings on conical crowns and the technique of using cast female parts on parallel

The analysis of the t test indicated that in the four properties (parallel and normal compressions, parallel traction and static bending), the values of the