THE HYPERBOLIC SMOOTHING SOFT CLUSTERING METHOD Adilson Elias Xavier Programa de Engenharia de Sistemas e Computação/COPPE - UFRJ Av. Horácio de Macedo, 2030 - CT - Bloco H - 21941-914 [email protected]

(1)

THE HYPERBOLIC SMOOTHING SOFT CLUSTERING METHOD

Adilson Elias Xavier

Programa de Engenharia de Sistemas e Computação/COPPE - UFRJ Av. Horácio de Macedo, 2030 - CT - Bloco H - 21941-914

[email protected] Luiz Carlos Ferreira de Sousa

PETROBRAS

Av. Almirante Barroso, 81 - Centro - Rio de Janeiro - RJ - Brasil - 20031-004 [email protected]

ABSTRACT

A new soft clustering method is derived from the hyperbolic smoothing hard clustering method, originally designed for the minimum sum-of-squares clustering problem. The mathematical modeling of this problem leads to a min-sum-min formulation which, in addition to its intrinsic bi-level nature, has the significant characteristic of being strongly nondifferentiable. The proposed method adopts a smoothing strategy using a particular C^∞ differentiable class function.

KEYWORDS: Clustering, Nondifferentiable Optimization, Nonlinear Programming, Mathematical Programming.

(2)

1. Introduction

Cluster analysis deals with the classification problem of a set of patterns or observations, represented as points in a multidimensional space, into clusters. The process follows two basic and simultaneous objectives: patterns in the same cluster must be similar (homogeneity objective) and different from patterns which belong to other clusters (separation objective).

Clustering is an important problem that appears in the broadest spectrum of applications, whose intrinsic characteristics engender many approaches, see Jain and Dubes (1988) and Hansen and Jaumard (1997).

Hard clustering algorithms assign patterns to a single cluster whereas soft clustering algorithms assign a given pattern to all clusters with a certain probability of membership. In this paper, a soft clustering methodology is developed inside a hard clustering framework.

A particular hard clustering problem formulation is considered. Among many criteria used in cluster analysis, the most natural, intuitive and frequently adopted criterion is the minimum sum-of-squares clustering (MSSC). This criterion corresponds to the minimization of the sum of squared distances from all observations to their respective cluster mean. It is a criterion for both the homogeneity and the separation objectives, as, according to the Huygens Theorem, minimizing the within-cluster inertia of a partition (homogeneity within the cluster) is equivalent to maximizing the between-cluster inertia (separation between clusters).

The MSSC formulation produces a global optimization problem, both nondifferentiable and nonconvex, with a large number of local minimizers. It is one of the problems in the NP-hard class, see Brucker (1978).

The focus of this paper is first to present the smoothing of the min-sum-min problem engendered by the modeling of the MSSC problem. In a sense, the process we employ is an extension of a smoothing scheme, called Hyperbolic Smoothing. This technique was developed through an adaptation of the hyperbolic penalty method originally introduced by Xavier (1982), see Xavier (2001).

By smoothing, we fundamentally mean the substitution of an intrinsically nondifferentiable two-level problem by a C^∞ unconstrained differentiable single-level alternative. In this clustering application, the differentiability property is obtained by performing a set of transformations to the original problem, making possible the application of the hyperbolic smoothing scheme with the use of three different auxiliary parameters. The solution of the hard clustering formulation is achieved through the solution of a sequence of differentiable subproblems which gradually approaches the original problem.

Additionally, the paper presents the hyperbolic smoothing soft clustering method. By using any intermediary solution of the sequence of subproblems, it is possible to derive a soft clustering within the MSSC reference. In this way, it is established a perfect connection between the soft clustering solutions and the hard clustering solution, where the last one is obtained when the three parameters go to zero.

Although this paper considers the particular MSSC clustering problem, it must be emphasized that the proposed methodology, Hyperbolic Smoothing, can be used for solving other clustering problem formulations as well.

This paper is organized in the following way. A definition of the MSSC problem is presented in the next section. The transformation of this problem directly connected to the application of the hyperbolic smoothing approach is explained in Section 3. The smoothing procedures and the algorithm are described in Section 4. The derived soft clustering procedure is presented in Section 5. Illustrative computational results are presented in Section 6. Brief conclusions are drawn in Section 7.

(3)

2. The Hard Clustering Problem as a Min-Sum-Min Problem

Let S =

{

s₁,^K,sm

}

denote a set of m patterns or observations from an Euclidean n-dimensional space to be clustered into a given number q of disjoint clusters. In order to formulate the original clustering problem as a min-sum-min problem, we proceed as follows. Let

q i

x_i, =1,^K, , be the centroids of the clusters, where each x_i∈ℜⁿ. The set of these centroids coordinates will be represented by X ∈ℜ^nq. Given a point s_j of S, we initially calculate the distance from this point to the nearest center in X . This is given by

min _j _i 2 X

j x s x

z

i

−

= ∈ .

The most frequent measurement of a clustering quality associated to a specific position of q centroids is provided by the sum of these squared distances,

∑

=

m

j

zj

x D

1

) 2

( .

The optimal placing of the centroids, following the minimum sum of squares criterion, must provide the best quality of this measurement, so defining the following problem:

, min

min

1

2

∑

2

= ∈ ℜ

∈ −

m

j

i X j

X x

x s imize

nq i (1) where X is the set of all placements of the q centroids.

3. Transforming the Problem

The problem (1) above can be formulated equivalently as

2 1

1, , 2

min , 1, , .

m j j

j i q j i

minimize z subject to

z s x j m

=

= = − =

∑

K

(2)

Considering its definition, each z_j must necessarily satisfy the following set of inequalities: z_j s_j x_i 0,i 1, ,q

2

= K

≤

−

− . Putting these inequalities in the place of the equality constraints of problem (2), the following relaxed problem is obtained:

2 1

2 0, 1, , , 1, , .

m j j

j j i

z s x j m i q

=

− − ≤ = =

∑

K K

(3)

Since the variables z_j are not bounded from below, the optimum solution of the relaxed problem will be z_j =0, j =1,K,m. In order to obtain a bounded formulation, we do so by first letting ϕ(y) denote max{0,y} and then observing that, from the set of constraints in (3), it follows that

(4)

1 2

( ) 0, 1, , .

q

j j i

i

z s x j m

ϕ

=

− − = =

∑

^K (4) For fixed j and assuming d <K<d_q

1 with

i 2 j

i s x

d = − , Figure (1) illustrates the first three summands of (4) as a function of z_j.

Figure 1: Summands in (4)

Using (4) in place of the set of inequality constraints in (3), we would obtain an equivalent problem maintaining the undesirable property that z_j, j=1,K,m, still has no lower bound. Considering, however, that the objective function of problem (3) will force each

m j

z_j, =1,K, , downward, we can think of bounding the latter variables from below by considering ">" in place of "=" in (4). Therefore, we obtain the following “non-canonical”

problem:

2 1

1 2

( ) 0, 1, , .

m j j

q

j j i

i

z s x j m

ϕ

=

− − > =

∑

^K

(5)

The canonical formulation can be recovered from (5) by perturbing (4) and considering the modified problem:

2 1

1 2

( ) , 1, , .

m j j

q

j j i

i

z s x j m

ϕ ε

=

− − ≥ =

∑

^K

(6)

for ε >0. Since the feasible set of problem (6) is the limit of that of (5) when ε →0⁺, we can then consider solving (5) by solving a sequence of problems like (6) for a sequence of decreasing values for

ε

that approaches 0.

(5)

4. Smoothing the Problem

Analyzing the problem (6), the definition of function ϕ endows it with an extremely rigid nondifferentiable structure, which makes its computational solution very hard. In view of this, the numerical method which we adopt for solving problem (1) takes a smoothing approach.

From this perspective, let us define the function:

( )

^/²

) ,

( τ ² τ²

φ y = y+ y + (7)

for y∈ℜ and τ >0.

Function φ has the following properties:

(a) φ(y,τ)>ϕ(y), ∀τ >0; (b) lim ( , ) ( );

0φ yτ ϕ y

τ =

→

(c) φ(.,τ) is an increasing convex C^∞ function.

Therefore, function φ constitutes an smooth approximation of function ϕ. Adopting the same assumptions used in Figure 1, the first three summands of (4) and their corresponding smoothed approximations, given by (7), are depicted in Figure 2.

Figure 2: Original and smoothed summands in (4)

By using function φ in the place of function ϕ, the following problem

2 1

1 2

( , ) , 1, , .

m j j

q

j j i

i

z s x j m

φ τ ε

=

− − ≥ =

∑

^K

(8)

is produced.

In order to obtain a differentiable problem, it is yet necessary to smooth the Euclidean distance

i 2

j x

s − . For this purpose, let us define the function

(6)

∑

=

+

−

=

n

l

il jl i

j x s x

s

1

2

)2

( )

, ,

(

γ γ

θ

(9)

for γ >0.

Function θ has the following properties:

(a) lim ( , , ) ;

0 sj xi = sj −xi 2

→ θ γ

γ

(b) θ is a C^∞ function.

By using function θ in place of the Euclidean distance, the below completely differentiable problem is obtained.

2 1

1

( ( , , ), ) , 1, , .

m j j

q

j j i

i

z s x j m

φ θ γ τ ε

=

− ≥ =

∑

^K

(10)

is now obtained.

Therefore, the properties of functions φ and θ allow us to seek a solution for the problem (6) by solving a sequence of subproblems like the problem (10), produced by the decreasing of the parameters γ →0, τ →0 and ε →0.

On the other hand, given any set of centroids x_i,i=1,K, q, due to property (c) of the hyperbolic smoothing function φ, the constraints of the problem (10) will certainly be active. So, it will at last be equivalent to the problem:

2 1

1

( ( , , ), ) , 1, , .

m j j

q

j j i

i

z s x j m

φ θ γ τ ε

=

− = =

∑

^K

(11)

The dimension of variable domain space of the problem (11) is (nq+m). As, in general, the value of the parameter m, the cardinality of the set S of the observations s_j, is large, the problem (11) has a large number of variables. However, it has an intrinsically separable structure, because each variable z_j appears only in one equality constraint. Therefore, as the partial derivative of h(x,z_j) with respect to z_j, j =1,K,m, is not equal to zero, it is possible to use the Implicit Function Theorem to calculate each component z_j, j=1,K,m, as a function of the centroid variables x_i,i=1,K, q. In this way, the unconstrained problem

∑

^m zj

( )

x

minimize ² (12)

(7)

is obtained, where each z_j

( )

x results from the calculation of a zero of each equation

1

( , ) ( ( , , ), ) 0, 1, , .

q

j j j i

i

h x z φ z θ s x γ τ ε j m

=

∑

− − = = ^K (13)

Due property (c) of the hyperbolic smoothing function, each term φ above is strictly increasing with variable z_j and therefore the equation has a single zero.

Again, due to the Implicit Function Theorem, the functions z_j

( )

x have all derivatives with respect to the variables x_i,i=1,K, q, and therefore it is possible to calculate the gradient of the objective function of problem (12),

) ( ) ( 2 ) (

1

x z x z x

f _j

m

j

j ∇

=

∇

∑

=

(14)

where

). , / ( ) , ( )

(

j j j

j z

z x z h

x h x

z ∂

−∇ ∂

=

∇ (15)

The above approach employs the same basic idea as Abadie and Carpentier (1969) for the development of the general reduced gradient algorithm, intended for the solution of the general nonlinear programming problem subject to equality constraints.

In this way, it is easy to solve problem (12) by making use of any method based on first order derivative information. At last, it must be emphasized that problem (12) is defined on a

)

(nq -dimensional space, so it is a small problem, since the number of clusters q is, in general, very small for real applications.

The solution of the original clustering problem can be obtained by using the Hyperbolic Smoothing Clustering Algorithm, described below in a simplified form.

Simplified Hard HSC Algorithm

Initialization Step: Choose values 0<

ρ

₁ <1, 0<

ρ

₂ <1, 0<

ρ

₃ <1; let k =1 and choose initial values: x⁰,

γ

¹,

τ

¹,

ε

¹.

Main Step: Repeat until a stop rule is attained

Solve the problem (12) with

γ

=

γ

^k,

τ

=

τ

^k and

ε

=

ε

^k, starting at the initial point

−1

xk , and let x^k be the solution obtained.

Let

γ

^k⁺¹ =

ρ

₁

γ

^k,

τ

^k⁺¹ =

ρ

₂

τ

^k, ε^k⁺¹ =ρ₃ε^k, k =k+1. □ Just as in other smoothing methods, the solution to the clustering problem is obtained, in theory, by solving an infinite sequence of optimization problems. In the HSC algorithm, each problem that is minimized is unconstrained and of low dimension.

Notice that the algorithm causes γ ,

τ

and

ε

to approach 0 so the constraints of the subproblems it solves, given as in (11), tend to those of (6). In addition, the algorithm causes

ε

to approach 0 so, in a simultaneous movement, the solved problem (6) gradually approaches problem (5).

(8)

5. The Soft Clustering Procedure

Soft clustering can be produced in a natural way by solving problem (12) with any fixed set of positive values for parameters γ ,

τ

and

ε

. Let p_ij be the probability of the feature j belonging to cluster i. So, it must be observed: p_ij ≥ 0 and

1

q ij i

p

=

∑

= , for each j=1,..., .m Given any set of parameters γ ,

τ

and

ε

, let x^* be the solution obtained by solving problem (12). Let us define

* *

( ( ) ( , , ), ).

ij j j i

v =

φ

z x −

θ

s x

γ τ

From each constraint (13) of problem (12), we derive the equation:

1 q

ij i

v ε

=

∑

= ^.

Now, by making

ij ij/ p =v ε,

a soft clustering procedure is trivially produced within the MSSC point of view.

Moreover, let us define a monotonic crescent function

: , (0) 0, (1) 1.

f ℜ → ℜ f = f = So, it is possible to establish a more general soft clustering procedure by defining the alternative probability measure:

1

( / ) / ( / ), 1,..., .

q

ij ij ij

i

p f v ε f v ε j m

=

∑

= (16)

6. Computational Results

The numerical experiments have been carried out on a Pentium IV, 1GHz PC. The programs are coded with Compaq Visual FORTRAN, Version 6.1. The unconstrained minimization tasks were carried out by means of a Quasi-Newton algorithm employing the BFGS updating formula, obtained at the Harwell Library (http://www.cse.scitech.ac.uk/nag/hsl/). The produced results show both the reliability and the efficiency of the proposed method.

In order to further highlight the performance of the proposed hard HSC algorithm, results obtained by solving one big standard test problem from the cluster analysis literature are shown below. The problem is: the TSPLIB-3038, which uses 3038 points in the plane from a traveling salesman problem presented by Reinelt (1991).

Tables 1 contains the number of clusters (q), the putative global minimum ( f_opt), taken from du Merle et alli (2000), Bagirov and Yearwood (2004) or Bagirov (2008), the best solution produced by the HSC Algorithm ( f_HSC), the relative error (E_Best) for this solution, the best solution frequency, the number of initial start points used by the HSC Algorithm, the mean relative error (E_Mean) calculated with all solutions and the CPU time given in seconds, where the relative error is calculated as

(9)

( )

HSC opt .100

opt

f f

E f

= −

Table 1: Hard Clustering Results for the TSPLIB-3038 Instance

q

Putative Global Minimum

Best Solution HSC

Algorithm E_Best

Best Solution Frequence

Start Points

HSC Algorithm

E_Mean CPU Time

2 0.31688E10 0.31705E10 0.05 10 10 0.05 0.60

3 0.21763E10 0.21776E10 0.06 7 10 1.08 1.18

4 0.14790E10 0.14793E10 0.02 10 10 0.02 1.81

5 0.11982E10 0.11986E10 0.03 9 10 0.06 3.09

6 0.96918E9 0.96936E9 0.02 10 10 0.02 9.56

7 0.83966E9 0.83967E9 0.001 7 10 4.66 10.65

8 0.73475E9 0.73491E9 0.02 1 10 1.35 16.24

9 0.64477E9 0.64471E9 -0.01 2 10 0.38 10.05

10 0.56025E9 0.56030E9 +0.01 10 10 0.01 16.16

20 0.26681E9 0.26675E9 -0.02 1 10 2.51 62.90

30 0.17557E9 0.17543E9 -0.08 3 10 0.36 169.17

40 0.12548E9 0.12541E9 -0.06 1 20 1.34 333.92

50 0.98400E8 0.98893E8 +0.50 1 40 1.46 662.33

60 0.82006E8 0.81115E8 -1.09 1 10 0.19 1049.97

80 0.61217E8 0.61025E8 -0.31 1 10 0.63 2038.31

100 0.48912E8 0.48470E8 -0.90 1 10 0.19 3499.76

From the results presented in Table 1, we observe that the implementation of the HSC Algorithm, computes a best solution which is very close to the putative global minimum, the best known solution of the TSTLIB-3038 instance produced by all algorithms in all computational experiments. Moreover, in this preliminary experiment, by using a relatively small number of initial starts points, seven new putative global minimum results (q=9, q=20, q=30, q=40, q=60, q=80 and q=100) have been established, as recorded in the literature. On the other hand, the low values shown in column E_Mean indicates that the HSC Algorithm computes deep local minima, meaning a good average performance.

Table 2 shows soft clustering computational results obtained by solving the classical Iris Fisher test problem, with 150 patterns with 4 components each. This data set was obtained from 50 samples of three iris flower varieties from the Gaspé Peninsula. It was originally used by Fisher (1936).

Three computational independent experiments were performed by varying the soft clustering parameters γ ,

τ

and

ε

. In Table 2, only aggregate results are presented. Columns C1, C2 and C3 represent the three soft clusters produced in each one of the three instances or independent solutions.

Each row indicates the cluster association for each Iris variety. So, for each test instance, each cell ki represents the association of variety k to cluster i, calculated by

k

ij j J

p

∈

∑

^{, where}Jk is the set of observations belonging to variety k.

First these results show that, when the parameters decrease, all Iris Setosa observations become perfectly separated in the cluster C1 from the other two clusters. In the same way, cluster C1 contains Iris Setosa observations in crescent concentration. On the other hand, in three instances, cluster C2 and cluster C3 contain a mixed composition, the first dominantly Iris Versicolor and the second dominantly Iris Virginica. When the soft clustering parameters γ ,

τ

and

ε

decrease, C2 contains about 36 Iris Versicolor observations and 38 as a whole while C3 contains about 48 Iris Virginica observations and 62 as a whole. These last results are consentaneous to those published in the literature, besides providing an insight into the manipulation power of soft clustering parameters.

(10)

Table 2: Soft Clustering Results for the Iris Fisher Instance

( , , )τ ε γ ^⇒ (3.45, 3.45, 1.725) (0.43, 0.43, 0.215) (0.05, 0.05, 0.025)

⇓ Iris ⇓ ^C1 ^C2 ^C3 ^C1 ^C2 ^C3 ^C1 ^C2 ^C3

Setosa Versicolor

Virginica

44.91 2.18 3.85

1.98 33.82

9.30

3.11 14.00 36.85

49.39 0.27 0.53

0.24 35.81

3.32

0.37 13.92 46.15

49.92 0.03 0.07

0.03 35.97

2.19

0.05 14.00 47.74

7. Conclusions

In view of the preliminary results, where the proposed methodology shows efficient and robust performances, we believe that this algorithm can represent a possible approach for dealing with both the hard and the soft clustering of real applications.

Moreover, the existence of three soft clustering parameters, γ ,

τ

and

ε

, offers a broad possibility to construct soft clustering procedures tailored to the particular requirements of real applications.

References

Abadie, J. and Carpentier, J., (1969) “Generalization of the Wolfe Reduced Gradient Method to the Case of Nonlinear Constraints”, in “Optimization”, R. Fletcher (editor), Academic Press, London, pages 37-47.

Bagirov, A. M. and Yearwood, J. (2006) “A New Nonsmooth Optimization Algorithm for Minimum Sum-of-Squares Clustering Problems”, European Journal of Operational Research, Vol. 170 Issue 2 pp. 578-596.

Bagirov, A. M. (2008) “Modified Global k-means Algorithm for Minimum Sum-of-Squares Clustering Problems”, Pattern Recognition, doi:10.1016/j.patcog.2008.04.004.

Brucker, P. (1978) “On the Complexity of Clustering Problems”, in: M. Beckmann and H. P.

Kunzi, eds., Optimization and Operations Research, Lecture Notes in Economics and Mathematical Systems, v. 157 (Springer, Heidelberg,) 45-54.

Fisher, R. A. (1936) “The Use of Multiple Measurements in Taxonomic Problems”, Ann.

Eugenics, VII part II pp. 179-188. Reprinted in R. A. Fisher (1950) “Contributions to Mathematical Statistics”, Wiley.

Hansen, P. and Jaumard, B. (1997) “Cluster Analysis and Mathematical Programming”, Mathematical Programming No. 79 pp. 191-215.

Jain, A. K. and Dubes, R. C. (1988) “Algorithms for Clustering Data”, Prentice-Hall Inc., Upper Saddle River, NJ.

du Merle, O. ; Hansen, P.; Jaumard, B.; Mladenovic, V. (2000) “An Interior Point Algorithm for Minimum Sum-of-Squares Clustering”, SIAM Journal on Scientific Computing No. 21 pp.

1485-1505.

Reinelt, G. (1991) “TSP-LIB - A Traveling Salesman Library”, ORSA J. Comput., pp. 376-384.

Sousa, L.C.F. (2005), “Desempenho Computacional do Método de Agrupamento Via Suavização Hiperbólica”, M.Sc. thesis - COPPE - UFRJ, Rio de Janeiro.

Späth, H. (1980), “Cluster Analysis Algorithms for Data Reduction and Classification”, Ellis Horwood, Upper Saddle River, NJ.

Xavier, A. E. (1982), “Penalização Hiperbólica: Um Novo Método para Resolução de Problemas de Otimização”, Dissertação de M.Sc., COPPE/UFRJ, Rio de Janeiro, RJ.

Xavier, A. E. (2001), “Hyperbolic Penalty: a new method for nonlinear programming with inequalities”, Intl. Trans. in Op. Res., n. 8, pp. 659-671.