Κατανεμημένη υλοποίηση του αλγορίθμου Proclus με Hadoop MapReduce

(1)

NATIONAL AND KAPODISTRIAN UNIVERSITY OF ATHENS

SCHOOL OF SCIENCE

DEPARTMENT OF INFORMATICS AND TELECOMMUNICATION

BSc THESIS

Distributed Implementation of Proclus Algorithm with Hadoop MapReduce

Felina-Sotiria K. Gogou Petroula D. Stamou

Supervisor: Dimitrios Gounopoulos, Professor

ATHENS JULY 2019

(2)

ΕΘΝΙΚΟ ΚΑΙ ΚΑΠΟΔΙΣΤΡΙΑΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΘΗΝΩΝ

ΣΧΟΛΗ ΘΕΤΙΚΩΝ ΕΠΙΣΤΗΜΩΝ

ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΚΑΙ ΤΗΛΕΠΙΚΟΙΝΩΝΙΩΝ

ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ

Κατανεμημένη υλοποίηση του αλγορίθμου Proclus με Hadoop MapReduce

Φελίνα-Σωτηρία Κ. Γώγου Πετρούλα Δ. Στάμου

Επιβλέπων: Δημήτριος Γουνόπουλος, Καθηγητής

ΑΘΗΝΑ ΙΟΥΛΙΟΣ 2019

(3)

BSc THESIS

Distributed Implementation of Proclus Algorithm with Hadoop MapReduce

Felina-Sotiria K. Gogou S.N.: 1115201300241

Petroula D. Stamou S.N.: 1115201000183

Supervisor: Dimitrios Gounopoulos, Professor

(4)

ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ

Κατανεμημένη υλοποίηση του αλγορίθμου Proclus με Hadoop MapReduce

Φελίνα-Σωτηρία Κ. Γώγου Α.Μ.: 1115201300241

Πετρούλα Δ. Στάμου Α.Μ.: 1115201000183

Επιβλέπων: Δημήτριος Γουνόπουλος, Καθηγητής

(5)

ABSTRACT

PROCLUS [1] is an algorithm that detects clusters in small projected subspaces for data of high dimensionality. It samples the data, selects a set of k medoids and iteratively improves the their clustering. The algorithm uses a three-phase approach consisting of initialization, iteration, and cluster refinement.

In this thesis, we implement this serial algorithm in a distributed way. In general, algorithms vary significantly in how parallelizable they can become, ranging from easily parallelizable to completely non parallelizable. In our case, we used Hadoop MapReduce, which is a software framework capable of processing vast amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. Finally, we present a report on several experiments with different datasets and in order to show the scalability and the efficiency of the algorithm.

SUBJECT AREA: Data mining

Keywords: Hadoop, MapReduce, parallel, high dimension, clustering

(6)

ΠΕΡΙΛΗΨΗ

Ο PROCLUS [1] είναι ένας αλγόριθμος που ανιχνεύει συστάδες σε μικρούς προβαλλόμενους υποχωρους για δεδομένα μεγάλων διαστάσεων. Εξετάζει τα δεδομένα, επιλέγοντας ένα σύνολο k medoids και βελτιώνει διαδοχικά την ομαδοποίηση τους. Ο αλγόριθμος χρησιμοποιεί μια τριφασική προσέγγιση που αποτελείται από την αρχικοποίηση, την επανάληψη και τη βελτίωση των συστάδων.

Σε αυτή τη διατριβή, εφαρμόζουμε αυτόν τον σειριακό αλγόριθμο με κατανεμημένο τρόπο.

Γενικά, οι αλγόριθμοι διαφέρουν σημαντικά όσον αφορά το πόσο μπορούν να υλοποιηθούν με παράλληλο προγραμματισμό, κυμαινομενοι από εύκολα παραλληλοποιήσιμοι έως εντελώς μη παραλληλοποιήσιμοι. Στην περίπτωσή μας, χρησιμοποιήσαμε το Hadoop MapReduce, το οποίο είναι ένα λογισμικό ικανό να επεξεργαστεί τεράστιες ποσότητες δεδομένων παράλληλα σε μεγάλες συστάδες αποτελούμενες από υλικό βασικού εξοπλισμού με αξιόπιστο, και ανεκτικό σε σφάλματα τρόπο. Τέλος, παρουσιάζουμε μια αναφορά σε διάφορα πειράματα με διαφορετικά σύνολα δεδομένων, για να δείξουμε την κλιμάκωση και την αποδοτικότητα του αλγορίθμου.

ΘΕΜΑΤΙΚΗ ΠΕΡΙΟΧΗ: Εξόρυξη δεδομένων

ΛΕΞΕΙΣ ΚΛΕΙΔΙΑ: Hadoop, MapReduce, παραλληλισμος, υψηλές διαστασεις, ομαδοποίηση

(7)

LIST OF FIGURES

Figure 1 Traditional Systems (RDBMS) and Hadoop approach ... 12

Figure 2 The basic model for MapReduce derives from the map and reduce concept ... 13

Figure 3 MapReduce execution overview [3] ... 14

Figure 4 Proclus flowchart... 15

Figure 5 Proclus algorithm ... 16

Figure 6 Proclus MapReduce jobs ... 18

Figure 7 A comparison of how containers and virtual machines are organized ... 19

Figure 8 Proclus Initialization phase code ... 21

Figure 9 Confusion matrix of DIM-sets dataset with 32 dimensions ... 22

(9)

LIST OF TABLES

Table 1 DIM-sets dataset properties ... 22

Table 2 G2-sets dataset properties ... 23

Table 3 G2-sets dataset properties ... 23

Table 4 Accuracy range for each dataset ... 24

(10)

PREFACE

Before you lie our thesis, the basis of which is a distributed approach of the Proclus algorithm.

It has been written to fulfill the graduation requirements of the Bachelor program of the Department of Informatics and Telecommunications of the National and Kapodistrian University of Athens.

We would like to thank our supervisor for his excellent guidance and support during this process.

(11)

Proclus with Hadoop MapReduce

F.S. Gogou-P. Stamou 11

1. INTRODUCTION

Huge volumes of data are collected and need to be processed on a daily basis. For example, according to the sixth edition of DOMO's report [5], over 2.5 quintillion bytes of data are created every single day, and it is only going to grow from there. By 2020, it is estimated that 1.7MB of data will be created every second for every person on earth. Therefore, data-processing or data mining cannot be executed locally in one machine but have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with those issues. As a reaction to this complexity, a simple yet powerful execution engine was created, MapReduce (Figure 2), which can be complemented with other data storage and management components to fit anyone's needs. The major difference between the traditional approach and the MapReduce is shown in Figure 1.

One of the primary data mining tasks is clustering, which aims at partitioning the data into groups of similar objects, called clusters. The similarity between objects is often determined using distance measures over various dimensions of the data [6, 7]. Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. This method has been investigated extensively since traditional clustering algorithms consider all of the dimensions of an input dataset in an attempt to learn as much as possible about each instance described. In high dimensional spaces there are many problems that do not occur in the ones with lower dimensionality. For example, not all dimensions may be relevant to a given cluster and including them can reduce accuracy. Moreover, there is the curse of dimensionality, which was first introduced by Bellman [8], and indicates that the number of dimensions needed to estimate an arbitrary function with a given level of accuracy grows exponentially with respect to the dimensionality of the function. Finally, the Hughes phenomenon [9] points out that there is an optimal number of features and adding more can degrade performance rather than increase it.

Proclus is a subspace clustering algorithm that solves the high dimensionality problems by selecting a subset of dimensions specific to the clusters themselves. Back when Proclus had been created, data were not in today's volume, so the algorithm was sequential with an iteration phase that correlated the whole dataset. Thus, the challenges that had to be overcome were to parallelize a sequential algorithm, such as Proclus, and then use it with the MapReduce framework, a framework that creates many limitations. Finally, we had to find a way to distribute the data in an efficient way so that the I/O and the network cost among processing nodes could be minimized.

(12)

Figure 1 Traditional Systems (RDBMS) and Hadoop approach

(13)

2. HADOOP MAPREDUCE

Hadoop MapReduce [3] is a software framework and programming model, which was created to facilitate the process of big data sets in a parallel, distributed and easily scalable way. The model enables concurrent processing by using a split-apply-combine strategy for data analysis.

MapReduce is composed of two separate and distinct tasks, the mapper and the reducer. The map functions are distributed across multiple machines by automatically partitioning the input data. They process a key/value pair to generate a set of intermediate key/value pairs, while the reducer merges all those intermediate values that are associated with the same intermediate key (Figure 2). The MapReduce framework is not only able to orchestrate the data transfer in a way that minimizes network overhead between the various parts of the system, but also it provides reliability and fault-tolerance. Thus, making it resilient to large-scale worker or master failures even though it runs on commodity hardware. Below is an example of the execution overview of a MapReduce program with all its actions and their sequence of occurrence (Figure 3).

Figure 2 The basic model for MapReduce derives from the map and reduce concept

(14)

Figure 3 MapReduce execution overview [3]

(15)

3. PROCLUS ALGORITHM

Proclus [1] (PROjected CLUStering) is a variation of K-medoid algorithm in subspace clustering.

The algorithm (Figure 5) requires the user to input the number of clusters that are going to be found (k), as well as the average number of dimensions. The output is a (k+1)-way partition {C1,

…, Ck, O} of the data, so that the points in each partition element, except the last, form a cluster (the last partition is consisting of outliers). Furthermore, the output provides a possibly different subset Di of dimensions for each cluster Ci.

The algorithm consists of three phases: the initialization phase, the iteration phase and the refinement phase. The purpose of the initialization phase it to choose a random set of k medoids by applying a greedy technique to samples the original data set. In our case, we use the greedy algorithm of T. Gonzalez [2]. In the second phase, the algorithm progressively improves the quality of medoids by iteratively replacing the bad medoids with new ones. It also computes a set of dimensions corresponding to each medoid so that the points assigned to the medoids best form a cluster in the subspace determined by those dimensions. The points assigned to medoids are based on Manhattan segmental distances relative to these sets of dimensions.

Finally, phase three computes new dimensions associated with each medoid and reassigns the points to the medoids relative to the new sets of dimensions. This is achieved by a cluster refinement phase, in which the data is passed over in order to improve the quality of the cluster.

As shown in Figure 4, Proclus is an iterative algorithm that computes the clusters and the corresponding dimensions in every iteration. This phase is the one that makes this algorithm

hard to parallelize due to the network communication overhead it creates.

Figure 4 Proclus flowchart

(16)

Figure 5 Proclus algorithm

(17)

4. PROCLUS WITH MAPREDUCE

Proclus, as already mentioned, is constituted of three phases. In the initialization phase, the serial algorithm computes the random medoids with a greedy technique. We parallelize this phase by reducing the size of the dataset in the mapper and then calling the greedy algorithm.

In order to reduce the amount of data that had to be moved over the network, we used a combiner optimization in the Initialization phases of Proclus. A combiner, is an optional localized reducer, who can group data in the map phase. It takes the intermediate keys from the mapper and applies a user-provided method to aggregate values in the small scope of that one mapper.

Combiners, in general, can provide extreme performance gains with no real downside.

However, there is no guarantee that they are going to be executed, so they cannot be a part of the overall algorithm. With our round-about approach, we manage to use the reducer implementation as the combiner’s job, so even if a combiner does not execute over any rows of data, they will still be accounted for in the reduce phase.

In the Iterative phase, we divide the algorithm in two major jobs. The first one, called FindDimensions, is used to find the most important dimensions for every cluster. It starts by assigning the points of the dataset to medoids. Each point can be assigned to more than one medoid, depending on whether the point is inside the medoid’s radius or not. Specifically, a point belongs to a medoid, if the distance between the medoid and the point is smaller or equal to the minimum delta distance of the medoid, where the delta distance is the minimum distance between one medoid and any other medoid. We implemented this by calculating the delta distance δi for each medoid mi in the initialization phase of the mapers. Τhen, the mapper reads line by line the input data and finds the set of points that are within distance δi from mi. Finally, the reducer will calculate the correlation between the dimensions and the medoids and choose the dimensions that are the most important for each medoid, with the only restriction to be that at least two dimensions must be chosen for every medoid.

The second job of the iterative phase, is to assign points to the clusters by using the set of dimensions that were chosen as the best for them. Then, its cluster is evaluated based on its assigned data points, in order to find the bad medoids that are going to be replaced in the next iteration. At the serial version of this algorithm, the assignment of the points and the evaluation are two different operations. however, we manage to reduce them in only one MapReduce job, which has the benefit of reducing the network overhead since we avoid transferring the dataset more than one time. More specifically, since the assignment of points to clusters does not require the processing of more than one data at a time, the mappers are able to assign their data point they receive to the appropriate clusters. Then, we use two reducers, one after the other. The first one calculates the centroid of each cluster and the second, which takes as input the output of the first one, calculates the overall evaluation of the clusters, based on their centroids and their data points.

In the last stage of the algorithm, the refinement phase, we find once again the correlated dimensions for every cluster and assign points to them. This task is done in the same way as previously described in the iterative phase and its implemented for the last time. Then, the

(18)

algorithm ends by storing its results into different files. The output files contain the medoids of each cluster and its points as well as the centroids.

Figure 6 Proclus MapReduce jobs

(19)

5. INSTALLATION

In order to avoid the arduous task of creating a working environment with Hadoop, but also to avoid the limitations of being able to run only a small number of VMs in one computer, we choose to use a docker containers. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings. In other words, a container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. Containers are an abstraction at the app layer that packages code and dependencies together. Another aspect of docker is that multiple containers can run on the same machine and share the OS kernel with other containers, each running as isolated processes in user space. Also, they take up less space than VMs (container images are typically tens of MBs in size), can handle more applications and require fewer VMs and Operating systems. Container images become containers at runtime and in the case of Docker containers - images become containers when they run on Docker Engine. Available for both Linux and Windows-based applications, containerized software will always run the same, regardless of the infrastructure [15].

On the other hand, Virtual machines (VMs) are an abstraction of physical hardware turning one server into many servers. The hypervisor allows multiple VMs to run on a single machine. Each VM includes a full copy of an operating system, the application, necessary binaries and libraries - taking up tens of GBs. All in all, container technology allows for a much larger scale of applications in virtualized environments, because of the efficiencies of virtualizing the OS compared to the traditional VMs. The difference of docker container and VM technology can be seen in Figure 7.

In our case we installed docker and used a preexisting image on the docker public registry [16]

as our base image. Although the image was set and ready to be used, we had to add the Mrjob library, in order to have a full functional working environment. This way we could test our implementation of Proclus with Hadoop MapReduce by creating as many containers as we needed.

Figure 7 A comparison of how containers and virtual machines are organized

(20)

6. CODE

Mrjob is a Python package that helps you write and run Hadoop Streaming jobs. We used it to develop our distributed approach of Proclus, given that Python is a high-level programming language that helps you focus on what you want to do rather than how to do it. We used Mrjob, a python library by taking advantage of Hadoop streaming, which is one of the most important utilities in Hadoop distribution. The Streaming interface of Hadoop allows you to write MapReduce programs in any language of your choice, thus it can be defined as a generic Hadoop API. With this API, Mappers receive input on STDIN and emit output on STDOUT. In conclusion, even though Hadoop is written in Java and it is the common language for MapReduce job, the Streaming API gives us the flexibility to write MapReduce job in Python.

Figure 8 is the code for the Initialization phase of Proclus, and it is a good example of how a MapReduce model must be simplified because of the imposed constraints on the way distributed algorithms should be organized to run over a MapReduce infrastructure. Clearly, the MapReduce can be advantageous for processing large quantities of data. However, it still exhibits limitations, mostly due to the fact that the abstractions provided to process data are very simple, and complex problems might require considerable effort to be represented in terms of map and reduce functions only. Some of those limitations made the development of the iteration phase and refinement phase extremely arduous and it took a really ingenuity from our part to accomplish this task.

The rest of our implementation is not including here to avoid speechifying and confusing the reader. Nonetheless, our code exists in a public repository on Github [13] in case someone wants to take a closer look at our work.

(21)

Figure 8 Proclus Initialization phase code

(22)

7. EXPERIMENTS AND RESULTS

One of the most challenging aspects of this thesis was finding the right datasets to test the algorithm, since it was of great importance that the data were right for the problem we wanted to solve. It would not be useful if we had terabytes of data if it was not aligned with the problem.

We tried to find data with features that matter to what we were attempting to cluster, and discard unrelated features. In other words, sometimes quality is more important than quantity. The difficulty of this task was derived from the fact that we did not only need to know the number of clusters a high-dimensional dataset has, but also if there were any clusters in its sub dimension space. In other words, we need a lot more information about the dataset we were going to use than it is usually offered.

We evaluated our implementation by directly measuring its accuracy and scalability with datasets that had different characteristics. For example, the first dataset that we selected was taken from the School of Computing of the University of Eastern Finland and it is called DIM- sets (Table 1) [12]. We chose it despite its small data size, because it has many different versions, with various dimension sizes, ranging from 32 up to 1024 dimensions. However, the distance-based approach of Proclus could lead to poor accuracy given that using a small number of representative’s points can cause the algorithm to miss some clusters entirely. As it turned out Proclus had very good accuracy with this dataset with the lowest to be at 87.5%.

That happened because we used only a small portion of the original dimensions (Figure 9), and, to be more exact, we chose 8 out of all. That decrease of accuracy has occurred every time we had made the algorithm use less than 8 dimensions. We attribute this to two possible reasons, the first one is that the dataset may have very compact clusters with no overlap between them, and the second one that some dimensions had greater value than the rest, resulting Proclus choosing them every time in spite of all the others. In the end, this dataset was a great starting point given that it helped us to calibrate the algorithm and measure its efficiency in high dimensionality.

Table 1 DIM-sets dataset properties Number of

Dimensions(l) Number of Clusters(k) Data Points(N) Data Points Per Cluster 32 / 64 / 128 / 256 /

512 / 1024 16 1024 64

Figure 9 Confusion matrix of DIM-sets dataset with 32 dimensions

(23)

The second dataset that we used was again taken from the School of Computing of the University of Eastern Finland [12], and it is called G2-sets. The dataset includes two clusters with Gaussian normal distributions and with various means and deviations. It was picked, because it provides us with a way to test the efficiency of the algorithm when there is an overlapping between the clusters. In addition, it gave us concrete evidence that our MapReduce implementation can work with a very small number of reducers, given that every cluster needs one reduce to compute its centroids and calculate their evaluation. The algorithm managed to have good accuracy even when the clusters were almost overlapping, but in order to do so we had to increase the amount of medoid that were going to be produced by the Initialization phase.

Table 2 G2-sets dataset properties Number of Dimensions

(l) Number of Clusters (k) Data Points (N) Data Points Per Cluster

128 / 256 / 512 / 1024 2 2048 1024

Finally, the last dataset that we tested was form the webpage of the Prof. Marek Gagolewsk of Warsaw University of Technology, called MNIST Handwritten Digits. This dataset was used to test the scalability of our implementation, and has two distinct characteristics, its size is and the sparsity of its clusters. That last characteristic was what enable PROCLUS to be finished in a reasonable amount of time with its accuracy in an accepted level. Sparsity let the algorithm be precise, even though it used only a small number of dimensions in the Iterative phase. It is worth mentioning that the original Proclus had been tested with a dataset with N=100.000, k=5, and only l=20. In our case, we may have had less data points N=70000, but we needed to estimate l=784 dimensions and k=10 clusters.

Table 3 G2-sets dataset properties Number of Dimensions

(l) Number of Clusters (k) Data Points (N) Data Points Per Cluster

784 10 70000 -

With the analysis above we deduce that the implementation of Proclus with MapReduce model gives accurate results for a variety of numeric datasets. More specifically, we accomplished an average accuracy greater than 85% even when some of the datasets had overlapping clusters.

In Table 4 we cite the precision of each dataset that was tested with this algorithm.

(24)

Table 4 Accuracy range for each dataset

Dataset Range Accuracy

DIM-sets 87.5%-100%

G2-sets 85.3%-97.8%

MNIST Handwritten Digits 70.2%-93%

(25)

8. CONCLUSIONS

The goal of this thesis was to implement a serial algorithm in a parallel and distributed way using MapReduce. We parallelized PROCLUS algorithm with the use of MapReduce model and with the experimental results presented here, we showed that our distributed implementation has good accuracy and scales efficiently with large datasets. The main obstacles that we stumbled upon were first how to minimize the I/O cost and avoid the high network cost among processing nodes, and second to perform the algorithm with MapReduce while having in mind the restriction that this framework creates. We overcame the aforementioned difficulties by carefully crafting our solution and, utilizing all the functionality the MapReduce framework has to offer.

(26)

REFERENCES

[1] C.C. AggarwaL, C. Procopiuc, J.L. Wolf, P.S. Yu, and J.S. Park. Fast algorithms for projected clustering. SIGMOD Rec., 28(2):61-72, 1999.

[2] T. Gonzalez. Theoretical Computer Science, Vol. 38, pp. 293-366, 1985.

[3] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. OSDI, 2004 [4] R.L. F. Cordeiro, C. Traina Jr, A.J.M. Traina , J. López , U. Kang , C. Faloutsos, Clustering very

large multi-dimensional datasets with MapReduce, Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, August 21-24, 2011.

[5] Domo. Data never sleeps 6.0 how much data is generated every minute? 2018.

[6] A.K. Jain, M.N. Murty and P.J. Flynn. Data clustering: a review. ACM Computing Surveys (CSUR), 1999.

[7] M. Kamber and J. Han. Data Mining: concepts and techniques, chapter 8, pages 335-393. Morgan Kaufmann Publishers. 2001.

[8] Bellman R.E. Adaptive Control Processes. Princeton University Press, Princeton, NJ, 1961 [9] Hughes, G.F. "On the mean accuracy of statistical pattern recognizers". IEEE Transactions on

Information Theory. 14 (1): 55-63, January 1968

[10] G. Gan, C. Ma and J. Wu. Data Clustering: Theory, Algorithms, and Applications. pages 243-249.

2007

[11] S. Papadimitriou and J. Sun. DisCo: distributed co-clustering with Map-Reduce, IBM T.J. Watson Research Center, Hawthorne, NY, USA. 2008

[12] School of Computing University of Eastern Finland, http://cs.joensuu.fi/sipu/datasets/, last visited 15/06/2019

[13] https://github.com/PetraSt/Proclus-with-Map-Reduce, Clustering basic benchmark, last visited 15/06/2019

[14] Prof. Marek Gagolewsk of Warsaw University of Technology, Clustering Benchmark Data, http://www.gagolewski.com/resources/data/clustering/, 2009, last visited 15/06/2019 [15] https://www.docker.com/resources/what-container, last visited 07/07/2019

[16] https://hub.docker.com/r/harisekhon/hadoop/, last visited 07/07/2019

Κατανεμημένη υλοποίηση του αλγορίθμου Proclus με Hadoop MapReduce

NATIONAL AND KAPODISTRIAN UNIVERSITY OF ATHENS

Distributed Implementation of Proclus Algorithm with Hadoop MapReduce

ΕΘΝΙΚΟ ΚΑΙ ΚΑΠΟΔΙΣΤΡΙΑΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΘΗΝΩΝ

Κατανεμημένη υλοποίηση του αλγορίθμου Proclus με Hadoop MapReduce

ABSTRACT

ΠΕΡΙΛΗΨΗ

CONTENTS

LIST OF FIGURES

LIST OF TABLES

PREFACE

1. INTRODUCTION

2. HADOOP MAPREDUCE

3. PROCLUS ALGORITHM

4. PROCLUS WITH MAPREDUCE

5. INSTALLATION

6. CODE

7. EXPERIMENTS AND RESULTS

8. CONCLUSIONS

REFERENCES