Column generation bounds for numerical microaggregation

(1)

DOI 10.1007/s10898-014-0149-3

Column generation bounds for numerical

microaggregation

Daniel Aloise · Pierre Hansen · Caroline Rocha · Éverton Santi

Received: 13 February 2013 / Accepted: 14 January 2014 / Published online: 18 February 2014 © Springer Science+Business Media New York 2014

Abstract The biggest challenge when disclosing private data is to share information

con-tained in databases while protecting people from being individually identified. Microaggrega-tion is a family of methods for statistical disclosure control. The principle of microaggregaMicroaggrega-tion is that confidentiality rules permit the publication of individual records if they are partitioned into groups of size larger or equal to a fixed threshold value, where none is more representative than the others in the same group. The application of such rules leads to replacing individual values by those computed from small groups (microaggregates), before data publication. This work proposes a column generation algorithm for numerical microaggregation in which its pricing problem is solved by a specialized branch-and-bound. The algorithm is able to find, for the first time, lower bounds for instances of three real-world datasets commonly used in the literature. Furthermore, new best known solutions are obtained for these instances by means of a simple heuristic method with the columns generated.

Keywords Microaggregation· Column generation · Cuts · Branch-and-bound

1 Introduction

The objective in the discipline of statistical disclosure control (SDC) is to allow data to be mined while securing private information [35,36]. Indeed the best form for data protection

D. Aloise (

B

)· C. Rocha · É. Santi

Universidade Federal do Rio Grande do Norte, Campus Universitário s/n, Natal, RN 59072-970, Brazil e-mail: daniel.aloise@gerad.ca C. Rocha e-mail: caroline.rocha@ect.ufrn.br É. Santi e-mail: santi.everton@gmail.com P. Hansen

GERAD and HEC Montréal, 3000, Chemin de la Côte-Sainte-Catherine, Montreal, QC H3T 2A7, Canada e-mail: pierre.hansen@gerad.ca

(2)

is through its encryption. However, encrypted data has no utility for data mining techniques. SDC acts in the tradeoff between the maximum utility of using raw data and the maximum protection provided by encryption. SDC techniques often result in data being modified before made publicly available.

Microaggregation is a class of perturbative SDC methods for microdata (individual records) which has been extensively studied recently (e.g., [20,27,28,34]). The principle of microaggregation is that individual records can be replaced by those computed from small homogeneous groups (microaggregates), before data publication. Since the protected dataset contains only the masked data, its disclosure is less likely to violate individual privacy.

Domingo-Ferrer and Mateo-Sanz defined in [7] a mathematical programming model for microaggregation over numerical microdata. This paper is now cited more than 350 times according to Google Scholar. In their model, microdata of n individual records, represented by points pi = (pri, r = 1, . . . , s) inRsfor i= 1, . . . , n, are partitioned into clusters larger

than or equal to a parameter g by minimizing the sum of squared Euclidean distances from each point to the center of the cluster to which it belongs. The model is mathematically expressed as follows: S S E= min x,y,k n i=1 k j=1 xi jpi− yj2 subject to k j=1 xi j= 1, ∀i = 1, . . . , n n i=1 xi j≥ g, ∀ j = 1, . . . , k xi j∈ {0, 1}, ∀i = 1, . . . , n; ∀ j = 1, . . . , k yj ∈Rs ∀ j = 1, . . . , k. (1)

The binary decision variables xi j express the assignment of the point pi to the cluster j

centered at variables yj ∈ Rs, for j = 1, . . . , k, where k is also a variable of the model.

The norm · denotes the Euclidean distance between the two points in its argument in the s-dimensional space under consideration. The first set of constraints assures that every point pi, i = 1, . . . , n, is assigned to a cluster. The second set of constraints defines that the size

of each cluster is greater than or equal to g. Otherwise, the trivial optimal solution would consist of singleton clusters with k equal to n.

For a fixed x, first-order conditions on the gradient of the objective function require that at an optimal solution n i=1 xi j(yrj− pri) = 0, ∀ j, r, i.e., yrj = n i=1xi jpr_i n i=1xi j , ∀ j, r. (2) Hence, the optimal y variables are always at the centroids of the clusters. In microagregation, each microdata is masked to the coordinates of its centroid before publication. Remark that when parameter g = 0, points and centroids are the same in the optimal trivial solution. Consequently, the data is completely unprotected. Data protection is likely to be higher as

(3)

larger values of g are used in (1) since each microdata is masked to the centroids of larger clusters. Naturally, this masking process comes with some information loss which may spoil the data for future use. It is up to the data owner to decide which is the appropriated masking level for his data.

Problem (1) can be solved in time O(g2n) for s= 1 with an algorithm proposed by Hansen and Mukherjee in [17]. However, the problem is NP-hard in general dimension [26]. Conse-quently, many heuristics have been proposed in the literature to approach the problem for mul-tivariate instances. They include the maximum distance to average vector (MDAV) method proposed by Domingo-Ferrer and Torra [8] and its variant called V-MDAV [32], the centroid-based fixed-size (CBFS) algorithm of Laszlo and Mukherjee [22], the density based algorithm (DBA) of Lin et al. [24], the two fixed reference points (TFRP) method of Chang et al. [6], the μ-approx heuristic of Domingo-Ferrer et al. [10] (a similar heuristic is presented by Aggarwal et al. [1]), and the sucessive group selection algorithm of Panagiotakis and Tziritas [27].

It is custommary to compare heuristics for NP-hard problems measuring gaps between the heuristic solutions and the optimal values obtained by worst-case exponential methods (e.g., branch-and-bound). If the optimal solutions are not available, which is often the case for instances which are more than toy examples (an example of a toy microaggregation instance with n= 11 and s = 2 is given in [9], where it was exactly solved by enumeration), the gaps are calculated by using lower bounds obtained from less expensive procedures. The above heuristics for microaggregation share the fact that their performances in terms of solution quality are usually compared only with respect to the upper bound solutions obtained to three real datasets from [7]: (i) Tarragona with n= 834 and s = 13, (ii) census with n = 1,080 and s = 13, and (iii) Eia with n = 4,092 and s = 11; for g = 3, 5 and 10. Some authors report experiments for larger datasets with n= 100,000 points (e.g., [6,31,33]) in order to measure the scalability of their algorithms.

In this paper, a column generation (CG) algorithm is proposed to provide lower bounds for model (1). Consequently, it will be possible to evaluate more precisely how far heuristics solutions are from the true optimal values. To the best of our knowledge, it is the first time that lower bounds are provided for microagregation in multivariate data. Moreover, a simple heuristic procedure is derived from the lower bound solution method, consisting of solving a binary integer program with the columns generated. The rest of the paper is organized as follows. The CG formulation is described in Sect.2. Section3describes the CG pricing problem and its mathematical properties, exploited in the specialized branch-and-bound method of Sect.4. Section5presents a set of cutting planes which can be used to enhance the lower bound obtained by the CG algorithm. The structure of the whole method to compute lower bounds is presented in Sect.6. Computational experiments on benchmark datasets are reported in Sect.7. Finally, conclusions are given in Sect.7.2.

2 A column generation formulation

Problem (1) corresponds to a partitioning problem where the number of parts k is to be determined. Let us consider any cluster Ct, with|Ct| ≥ g, for which

ai t =

1 if point pibelongs to cluster Ct

0 otherwise,

and let us denote by ytthe centroid of points pisuch that ai t = 1. Thus, the cost ctof cluster

Ctcan be written as ct=

n

(4)

An alternative formulation for the microaggregation problem (1) is then given by min z t∈T ctzt subject to t∈T ai tzt = 1, ∀i = 1, . . . , n zt ∈ {0, 1} ∀t ∈ T, (3)

where T = {1, . . . , 2n− 1}. The zt variables are equal to 1 if cluster Ct is in the optimal

partition, and to 0 otherwise. The constraints state that each point is assigned to one cluster. This is a large set partitioning problem, for which the number of variables is exponential in the number n of points. Therefore, it cannot be explicitly written and solved in a straight-forward way unless n is small. Column generation works with a reasonably small subset T ⊆ T of the columns in (3), i.e., with a restricted master problem which is solved itera-tively, augmenting the number of columns in the restricted master problem until optimality is proved with the columns at hand. Entering columns are found by solving an auxiliary problem which consists in finding the list of points of a cluster whose associated variable in (3) has negative reduced cost.

In this paper, CG will be used to provide a lower bound for problem (1), and hence, only the linear relaxation of model (3) will be solved. If integrality was required, branching could be performed by the Ryan and Foster [30] branching rule.

2.1 Stabilization

A standard CG method for solving the linear relaxation for (3) suffers from very slow con-vergence due to high degeneracy. From the dual viewpoint, this means that there is a large number of optimal solutions in the dual problem. In order to speed up convergence, one can use the boxstep stabilization method of [25] whose principle is to identify a small region in the space of dual variables hopefully containing the optimal dual solution. In this method, any departure from this region is forbidden.

Consider the dual of the relaxed form of formulation (3) expressed by max n i=1 λi subject to n i=1ai tλi ≤ ct ∀t ∈ T λi free i= 1, . . . , n (4)

where theλifor i= 1, . . . , n are dual variables associated with the assignment constraints.

Lower and upper bounds can be included in (4) by the following set of constraints:

li≤ λi ≤ ui i= 1, . . . , n, (5)

where li ≥ 0 and ui ≥ 0 correspond to the lower and upper bounds, respectively, on dual

variableλi. du Merle et al. [11] propose a procedure to estimate the lower and upper bounds on

the variables of the dual formulation of the minimum sum-of-squares clustering problem [2], which differs from (1) since k is fixed and constraintsn_i₌₁xi j ≥ g, ∀ j = 1, . . . , k are

omitted. This is done by computing the increase (resp. the decrease) of the objective function value if a point is duplicated (resp. removed) from a heuristic solution (cf. [11] for details).

(5)

For i = 1, . . . , n, let cjr be the cost of the cluster jr containing point pi in the heuristic

solution and c_j

r the cost of that cluster when we omit pi. Then, li is estimated as cjr − c

jr.

Regarding the upper bound ui, it is estimated as cja− cjr, where cjais the cost of the cluster

jain which the addition of pi yields the minimum augmentation in the cost of the solution.

The same procedure in [11] was used here to estimate lower and upper bounds on the dual variables of (4). However, in some cases, the lower bound of a given dual variableλicannot

be estimated, since the removal of the associated point from its cluster makes the solution infeasible (i.e., the resulting cluster has cardinality smaller than g). For such cases, liis set

to−∞.

The resulting primal problem in which CG operates is then expressed by: min z,w+_,w− n i=1 uiw+i − n i=1 liwi−+ t∈T ctzt subject to w+i − w−i + t∈T ai tzt = 1, ∀i = 1, . . . , n w+_i , w−_i ≥ 0 ∀i = 1, . . . , n zt≥ 0 ∀t ∈ T, (6)

wherew+_i andw−_i , for i = 1, . . . , n, are primal variables associated to dual constraints λi ≤

uiandλi≥ li, respectively.

The use of formulation (6) by the CG algorithm implies that one must check whether there are constraintsλi≥ liandλi ≤ ui, for i = 1, . . . , n, which are active at the optimum

solution. If there is an active constraintλi = li, then limust be reduced; likewise, if there is

an active constraintλi = ui, uimust be increased. In our experiments, we halve the lower

bound li in the caseλi = li as long as li > = 10−6; when this is no longer true, liis set

to−∞. The upper bound ui is doubled in the caseλi = ui. Otherwise, if no such active

constraints exist, it means that the linear relaxation of (3) was solved to optimality.

3 Pricing problem

From the dual viewpoint, problem (6) is solved using a cutting plane method, starting with a relaxation and adding constraints as necessary. Given dual valuesλ, a violated cut is searched to be added to the relaxed dual problem. The slack or, when negative, the violationπt of

a constraint is given byπt = ct−

n

i=1λiai t. Since we are interested in finding violated

constraintsπt< 0, the pricing problem is given by π∗= mintπt. Although the enumeration

ofπtfor all t∈ T is too expensive, the value of π∗can be found by solving

min n i=1 (pi− yv2− λi)vi n i=1 vi ≥ g vi ∈ {0, 1}, ∀i = 1, . . . , n y_v ∈Rs (7)

with yvdenoting the centroid of points pi for whichvi = 1. If π∗ < 0, then the optimal

solutionv∗to (7) is added as a cut to the relaxed dual problem (in the primal, this is equivalent to adding a column to the restricted master problem together with its associated primal variable). Otherwise, problem (6) is solved optimally.

(6)

At a first glance, for a given location y_v,vi is equal to 1 ifpi − yv2 ≤ λi, and to 0

otherwise. Geometrically, in the plane, this is equivalent to the condition thatvi = 1 if yv

belongs to a disc Di = {y | pi − y2 ≤ λi} (i.e., a disc with radius

√

λi centered at pi),

and 0 otherwise. However, due to constraintn_i₌₁vi ≥ g, a variable vimay be forced to be

equal to 1. Consequently, decomposition approaches as developed in [3] cannot be used here to solve the pricing problem.

Indeed, we need to find just one negative cost solution(v, yv) for (7) in order to add the corresponding column to the restricted master problem. Proposition 1 shows that, if it exists, such solution(v, yv) can be searched among those withn_i₌₁vi = g.

Proposition 1 If a negative cost solution(v, y) such thatn_i₌₁v_i ≥ g + 1 exists for (7), then a negative cost solution(v, y) withn_i₌₁v_i = g also exists.

Proof Without loss of generality, let us suppose that(v, y) is a negative cost solution for (7) withn_i₌₁v_i= g +1. Define S as the set of vcomponents for whichv_i= 1. Let us suppose now that there is no negative cost solution(v, y) for (7) withn_i₌₁v_i= g. In other words, it is impossible to find a yfor whichn_i₌₁(pi− y2− λi)vi is negative using only g

components ofv equal to 1. Consequently, there exists at least one elementvi∗ in S for

whichpi∗− y2− λi∗ > 0. Hence, if v_i∗is made equal to 0, a solution(v∗, y) with only g components equal to 1 is constructed, having negative cost smaller than that of(v, y), which is a contradiction.

Corollary 1 The optimal solution of the linear relaxation of problem (3) uses only variables whose columns have g non-zero elements.

Proof From Proposition 1, a negative reduced cost variable associated to a column with k> g non-zero elements exists only if at least one negative reduced cost variable associated to a column with g non-zero elements also exists. Consequently, a CG algorithm for solving (3) is convergent if it adds at each iteration only variables associated to columns with exactly g non-zero elements.

Our CG algorithm uses the previous propositions to restrict the search in the pricing problem. In particular, microaggregation problems have been benchmarked in the literature using instances with g= 3, 5 and 10, which is advantageous for our approach due to three main reasons:

(i) CG has usually good performance when columns are small (i.e., with few entries equal to 1) [12];

(ii) Problem (3) is likely to be less degenerated when g is small. The S S E value for a given cluster is monotonic increasing in the number of points assigned to that [21], thereby favoring small clusters of size approximately g in the optimal solution. As more clusters exist, formulation (3) is likely to be less degenerated since more nonzero z variables have to be used to compose a solution;

(iii) By using Proposition 1, the enumeration of pricing solutions for small values of g is not an expensive task for moderate n.

4 Specialized branch-and-bound

Our specialized branch-and-bound bases on Proposition 1 to find negative reduced cost columns for the restricted master problem. As shown by Collorary 1, in order to solve (6) by

(7)

column generation, we can restrict ourselves to looking for columns with exactly g non-zero elements.

Let us considerα = {i|vi = 1} a negative cost solution for (7) with

n

i=1vi = g. A direct

result from Proposition 1 is that a negative reduced cost solution for (7) must be composed by at least one subsetβ of α, with |β| = g − 1, such that_i_∈β(pi− yv2− λi)vi < 0.

Likewise, the solution v composed by v_i = 1, ∀i ∈ β (which is infeasible for (7) since n_i₌₁v_i = g − 1), exists only if there is one subset γ of β, with |γ | = g − 2, such that_i_∈γ(pi − yv2 − λi)vi < 0, and so on. The branch-and-bound developed

here explores this fact to cut off branches as soon as the incumbent solution is non-negative.

The branch-and-bound algorithm to (7) for g = 3 is shown by Algorithm1. It per-forms an implicit enumeration of the solutions by means of the three loops of lines 1– 21, 5–20 and 11–18. Functioncostcomputesn_i₌₁(pi− yv2− λi)vi for givenv and

y_v. A branch is cut off (command continue) in the first loop ifλi ≤ 0 in line 3.

More-over, a branch can be eliminated in a deeper level in line 9 if the cost of the correspond-ing solution with two non-zero elements is greater than or equal to zero. Remark that the elimination criteria depend on the order in which the points are considered. For example, consider a solution with three non-zero elementsv1 = 1, v2 = 1 and v3 = 1 for which

cost(v, y_v) < 0. Proposition 1 assures that there is at least one subset of two non-zero elements V = {v1, v2, v3} for which cost(v, yv) < 0 (the other variable is set to 0), not

that all of them have that property. Consequently, all three loops in Algorithm1must go from 1 to n. The algorithm is stopped as soon as a complete solution with three non-zero elements and negative cost is found in line 15. This solution is enough for the purpose of finding one negative reduced cost column to be added to the restricted master prob-lem (6).

Since the number of nested for loops in Algorithm 1 is directly related to the value of g, the proposed algorithm is better suited for instances with g small, which is typically the case in the benchmark instances used in the literature.

Algorithm 1 Pseudo-code of the specialized branch-and-bound for (7). 1: for i= 1, . . . , n do 2: ifλi≤ 0 then 3: continue; 4: end if 5: for j= 1, . . . , n do 6: if i= j then 7: Makevi= 1, vj= 1, and vk= 0, ∀k = i, j; 8: ifcost(v, y_v) ≥ 0 then 9: continue; 10: end if 11: for k= 1, . . . , n do 12: if k= i, j then 13: Makevk= 1 and vl= 0, ∀l = i, j, k; 14: ifcost(v, y_v) < 0 then 15: return(v, cost(v, y_v)); 16: end if 17: end if 18: end for 19: end if 20: end for 21: end for

(8)

Moreover, a Variable Neighborhood Search (VNS) heuristic [15,16] is used to accelerate the solution of the pricing problem, which is solved by the heuristic until optimality must be checked by Algorithm1. The VNS heuristic used here is very similar to that proposed by du Merle et al. [11] to the pricing problem of their analytic center cutting plane method (ACCPM) [13] for the minimum sum-of-squares clustering problem. The only difference lies in the use of a different neighborhood structure. In [11], the neighborhood structure Nd(v) is defined by the Hamming distance ρ between solutions v and v(i.e., the number

of components in which these vectors differ): v ∈ Nd(v) ⇔ ρ(v, v) = d. Thus, the

VNS of [11] explores increasingly distant neighborhoods from the incumbent solution (i.e., by augmenting d), jumping from there to a new solution if and only if an improvement is made through local search. Since we are interested in finding negative cost solutions for (7) such thatn_i₌₁vi = g, our neighborhood structure, used within VNS, is defined

as: v∈ Nd(v) ⇔ ρ(v, v) = d and n i=1 v_i= g.

This modification allows to speed up the heuristic search, since its search space is consider-ably reduced.

5 Master problem cuts

Once the relaxation of (3) is solved, cuts can be added to strengthen the lower bound obtained. In [19], Ji and Mitchell propose a set of master problem cuts for a branch-and-price algorithm to the clique partitioning problem with minimum clique size requirement. In that problem, one looks for the partition of vertices of a complete graph Kn = (V, E) with edge weight ceon

each edge, that minimizes the total weight of edges having both endpoints in the same cluster (equiv. subclique). That problem is similar to the microaggregation problem (1), except for its objective function.

Theorem 1 (Ji and Mitchell [19]) Consider At = {i|ai t = 1} for t ∈ T in (3). Then, given

Q⊆ {1, . . . , n}, inequality (8) is a valid constraint for (3).

t:At⊆Q

zt≤ |Q|/g. (8)

The proof is trivial, since|Q|/g corresponds to the maximum number of clusters in which we can partition|Q| vertices respecting the minimum cardinality constraint on each cluster. As this inequality does not depend on the objective function used, the same separa-tion procedure proposed by Ji and Mitchell in [19] to the clique partitioning problem with minimum clique size requirement can be used for the microaggregation problem (3).

As checking all the possible subsets Q of{1, . . . , n} to separate is an expensive task, Ji and Mitchell propose in [19] a heuristic to separate only those subsets generated by combining the corresponding subsets Atof two fractional columns in the current restricted master problem

solution (see pp. 91–92 [19] for details). These combined subsets are stored in a set Q∗. The authors mention that, according to their experiments, this subset of inequalities of type (8) captures the most important cutting planes of that type.

(9)

The addition of cuts of type (8) changes the pricing problem as follows: min n i=1 (pi− yv2− λi)vi− p:Av⊆Qp σp n i=1 vi ≥ g vi ∈ {0, 1}, ∀i = 1, . . . , n yv ∈Rs (9)

where Qprefers to the p-th subset of Q∗,σp≤ 0, for p = 1, . . . , |Q∗|, are the dual values

associated to the master problem cuts (8), and A_v= {i|vi= 1}.

Algorithm 1 is changed to solve pricing problem (9). Indeed, function cost needs to be modified to take into consideration the dual valuesσ in the objective function. Such modification is computationally expensive since it involves verifying whether the index set of the incumbent solutionv is a subset of each set Qi, for i = 1, . . . , |Q∗|, in Q∗. If so, the

associated dual valueσiis subtracted. However, sinceσi ≤ 0, ∀i = 1, . . . , |Q∗|, we know

that a solution for whichn_i₌₁(pi − yv2− λi)vi ≥ 0 cannot be a negative reduced cost

column to the reduced master problem of (6), regardless of theσ dual values. In view of that, subset verification can be performed only when strictly necessary. Algorithm2presents the specialized branch-and-bound for (9) for g= 3.

The loop of lines 2–26 is very similar to Algorithm1. It tries to find out a negative reduced cost column with exactly g non-zero elements. Line 16 verifies if a solution(v, yv) with negative cost regarding the termn_i₌₁(pi− yv2− λi)viis still negative after

sub-tracting theσ dual values associated to v. If so, a negative reduced cost column is returned to the restricted master problem. Otherwise, the corresponding column is inserted in set Z . Remark that due to theσ values, Proposition 1 cannot be extended to problem (9). In fact, a solution to problem (9) may have more than g non-zero elements. However, it is still true that a negative cost solutionv for (9) with more than g non-zero elements must have a subset of g non-zero components for which the termn_i₌₁(pi− yv2− λi)vi is negative

(proof comes directly from Proposition 1). The set Z is used to store the solutions(v, y_v) which are negative with respect to the termn_i₌₁(pi− yv2− λi)vi, but eliminated due to

theσ dual values associated to the master problem cuts in (6) whose index set encompasses A_v.

The loop of lines 27–43 analyzes, for each solution(v, yv) in Z, if the addition of a new non-zero component to the solution yields a negative cost solution to problem (9). If so, the solution is returned in line 35. If the resulting solution has negative cost only with respect to the termn_i₌₁(pi− yv2− λi)vi, it is included in Z in line 37. The loop of

lines 27–43 is finite sincev can have at most n non-zero components. Moreover, it is shown in Domingo-Ferrer and Mateo-Sanz [7] that the sizes of groups in the optimal partition lie between g and 2g− 1. The proof is based on the fact that the objective function of (1) is a monotonic decreasing function in k. So, a group of size greater than 2g− 1 in the clustering can always be split into two clusters of size greater than or equal to g yielding a better partition. Thus, cost(v, yv) is implemented to return +∞ in line 33 whenever the number of non-zero components inv surpasses 2g − 1.

VNS is no longer used after the first master cut is added, since cheking subset pertinency would be too cumbersome for the heuristic. Consequentely, just after the first cuts of type (8) have been added to (6), Algorithm2is used at each CG iteration to obtain a negative reduced cost column.

(10)

Algorithm 2 Pseudo-code of the specialized branch-and-bound for (9). 1: Z← ∅ 2: for i= 1, . . . , n do 3: ifλi≤ 0 then 4: continue; 5: end if 6: for j= 1, . . . , n do 7: if i= j then 8: Makevi= 1, vj= 1, and vk= 0, ∀k = i, j; 9: ifcost(v, y_v) ≥ 0 then 10: continue; 11: end if 12: for k= 1, . . . , n do 13: if k= i, j then 14: Makevk= 1 and vl= 0, ∀l = i, j, k; 15: ifcost(v, yv) < 0 then 16: ifcost2(v, yv, Q∗) < 0 then 17: return_{(v, cost2(v, y}_v_{, Q}∗_)); 18: else 19: Z← Z ∪ (v, yv); 20: end if 21: end if 22: end if 23: end for 24: end if 25: end for 26: end for 27: while Z= ∅ do

28: select a solution(v, yv) from Z;

29: Z← Z \ (v, yv); 30: for i= 1, . . . , n do 31: ifvi= 0 then 32: vi← 1; 33: ifcost(v, yv) < 0 then 34: ifcost2_{(v, y}_v_{, Q}∗_{) < 0 then} 35: return(v, cost2(v, y_v, Q∗)); 36: else 37: Z← Z ∪ (v, yv); 38: end if 39: end if 40: vi← 0; 41: end if 42: end for 43: end while

6 Algorithm to compute lower bounds

We state in Algorithm3the complete CG method to compute lower bounds for numerical microaggregation. In order to understand it, the following procedures need to be defined:

(i) solveRestrictedMasterProblemsolves the incumbent restricted master problem, i.e., model (6) with its current subset of columns.

(ii) VNSuses the VNS heuristic to solve the pricing problem for the dual variablesλ returned from (i).

(iii) Algorithm1uses the Algorithm 1 to solve the pricing problem for the dual variablesλ returned from (i).

(11)

Algorithm 3 Pseudo-code of the column generation algorithm for (6). 1: Set the restricted master problem to an empty problem;

2: Compute an initial upper bound solution by means of a heuristic from the literature; 3: Add the corresponding columns to the restricted master problem;

4: From the solution computed in Step 2, obtain the coefficientes liand ui; 5: stop← f alse;

6: Q∗← ∅; 7: lb← −∞;

8: while stop= f alse do

9: (λ, σ) ←solveRestrictedMasterProblem;

10: if Q∗= ∅ then

11: (v, cost) ←VNS; 12: if cost< 0 then

13: Add columnv to the restricted master problem; 14: else

15: (v, cost) ←Algorithm1; 16: if cost< 0 then

17: Add columnv to the restricted master problem;

18: else 19: lb← n i=1λi; 20: testBoundsOnDualVariables; 21: if dual bounds are not updated then

22: Q∗← Q∗∪separateMasterProblemCuts; 23: end if 24: end if 25: end if 26: else 27: (v, cost) ←Algorithm2; 28: if cost< 0 then

29: Add columnv to the restricted master problem; 30: else 31: lb← n i=1λ i+ p:Qp∈Q∗ |Qp|/gσp; 32: testBoundsOnDualVariables; 33: if dual bounds are not updated then

34: Q∗← Q∗∪separateMasterProblemCuts; 35: if Q∗is not modified then

36: stop← true;

37: end if

38: end if 39: end if 40: end if

41: if time limit exceeded then

42: stop← true;

43: end if 44: end while 45: return lb;

(iv) testBoundsOnDualVariablestests if the bounds on the dual variablesλ from (i) are active and pushes them if so.

(v) separateMasterProblemCutsapplies the heuristic of [19] to separate master problem cuts, inserting their associated subsets to Q∗.

(vi) Algorithm2uses Algorithm 2 to solve the pricing problem for the dual variablesλ, σ returned from (i) and for the current set Q∗.

(12)

Table 1 List of data sets

a_{The attributes used are: weight,}

height, chest girth, waist girth and hip girth

Data sets n s Vertebral column [29] 310 6 Body measurementsa[19] 507 5 Tarragona [7] 834 13 Census [7] 1,080 13 Eia [7] 4,092 11 7 Computational experiments

Computational experiments were performed on a Pentium T4300 with a 2.1 GHz clock and 4 Gb of RAM memory. The algorithms were implemented in C++ and compiled by gcc 4.4.3 with option−O4. Five real-world data sets were used in our numerical experiments. They are briefly listed in Table1together with references to where more information about them can be found. In particular, the last three data sets have been used widely to compare heuristics for numerical microaggregation.

For all experiments reported here, initial upper bound solutions are obtained by the MDAV heuristic. They are used to estimate initial dual bounds which are adjusted throughout exe-cution whenever necessary. These estimations are always exact when the initial upper bound solution is the optimal one and no integrality gap exists (cf. [11]). Furthermore, CPLEX 12.1.0 is used to solve each iteration of the restricted master problem (6).

7.1 Lower bounds

Table2presents lower bounds for the instances in Table1for g= 3 and g = 5 without using the cuts of type (8). Then, for each instance, the third column I Llbrefers to the lower bound

of (6) expressed in terms of the information loss measure I L = SSE/SST , where SST is the sum of squared distances from each point to the centroid of the whole data. Information loss (I L) is a common measure of heuristic accuracy in microaggregation literature. Each heuristic is qualified by a measure of I L, and the lowest I L indicates the most accurate heuristic. The fourth column #i ter presents the number of iterations of the CG algorithm. The fifth column #bb refers to the number of times the specialized branch-and-bound (i.e., Algorithm 1, as no master problem cuts are added) is called to solve (7). The sixth column #du presents the number of iterations in which dual bounds, liand ui, for i= 1, . . . , n, were

updated during CG execution. Finally, the seventh column presents the computing times (in seconds) spent by the CG algorithm.

We remak from Table2that CG iterations and computing times grow with g. This aug-mentation occurs more rapidly for the computing times. In particular, the CPU time spent by the CG algorithm for instance Vertebral column with g = 10 is approximately 2days. For that reason, we do not report results using g= 10 for the larger data sets. Furthermore, the number of times the specialized branch-and-bound is called (column #bb) is often different from the number of times that the dual bounds are updated (column #du). This happens because the VNS heuristic occcasionally fails to find a negative reduced cost column for the restricted master problem (we used 100 VNS iterations in our tests; see [11] for more details about the VNS heuristic framework).

Table3presents the lower bounds for the instances in Table1for g= 3 and g = 5 with the addition of cuts of type (8) to problem (6). Column I Lwc_lb refers to the lower bounds of (6)

(13)

Table 2 CG results without cuts of type (8)

Data set g I Llb #i ter #bb #du Time(s)

Vertebral column 3 11.282 594 16 16 7.62 5 15.509 1,112 17 4 28.08 Body measurements 3 2.244 845 5 4 18.27 5 3.588 1,909 23 4 109.96 Tarragona 3 13.857 1,889 7 2 98.94 5 20.120 3,694 53 4 726.07 Census 3 4.646 2,056 16 4 187.25 5 7.209 4,140 54 5 1,111.86 Eia 3 0.323 3,890 105 9 8,001.24 5 0.689 7,213 178 11 29,100.40

Table 3 CG results with cuts of type (8)

Data set g I L_lbwc #i ter #du #cuts Time(s) gaplb(%)

Vertebral column 3 11.291 651 18 60 9.90 0.07 5 15.519 1,198 7 132 59.47 0.06 Body measurements 3 2.255 992 12 132 27.61 0.49 5 3.607 2,181 10 607 396.79 0.53 Tarragona 3 14.436 2,093 7 198 152.33 4.18 5 20.138 3,858 7 361 1,179.52 0.09 Census 3 4.662 2,310 7 603 267.94 0.34 5 7.222 4,512 13 853 3,188.17 0.18 Eia 3 0.344 5,078 9 9,037 15,512.10 6.50 5 0.700 8,722 30 3,117 * 1.60

after adding all cuts of type (8) separated by the heuristic of [19], except for instance Eia with g= 5 which was stopped after 1day of computation (this fact is indicated in the table with the ‘*’ symbol). Column #cuts reports the number of cuts added. Finally, column gaplb

refers to the relative difference (in %) between the lower bound I Llbshown in Table2and

I Lwc_lb, calculated as(I Lwc_lb − I Llb)/I Llb.

The results in Table3reveal that:

– The use of the master problem cuts in problem (6) may improve considerably the lower bound obtained by the CG algorithm. In particular, for Eia data set with g= 3, the lower bound was improved in 6.5 %.

– The addition of cuts to problem (6) may change considerably the dual optimum solution. This is observed when the values presented in columns #du are compared with those of Table2. The instance which requires the most number of dual updates is Eia with g= 5, which makes its solution more computationally expensive.

– The number of cuts appears to increase with g. Indeed, columns with a larger number of non-zero elements can be combined in more different ways.

(14)

– The slow-down factor caused by the addition of master problem cuts was in average 1.26 to instances with g = 3, and 2.55 to instances with g = 5 without considering instance Eia with g= 5.

Though not explicit by the results shown, the relaxed solutions obtained are always very fractionary, and hence, a large branch-and-bound tree is likely to be required in order to eliminate the integrality gap. Our preliminary numerical experiments with a branch-and-price algorithm discouraged the development of an exact approach that could solve problem (3) in reasonable CPU time.

7.2 Specialized branch-and-bound

In order to assess the performance of the specialized branch-and-bound of Sect.4in solving problem (7), we compare it with the convex mixed-integer non-linear problem (MINLP) solver Bonmin [4].

Bonmin requires a convex MINLP input problem. To this purpose, we perform a refor-mulation of problem (7) showing that its continuous relaxation is convex. First, we remark that problem (7) contains (nonconvex) bilinear terms in the objective function, so that its continuous relaxation is not necessarily easy to solve.

The following exact reformulation of (7) (in the sense of [23]) is proposed: variableswi

are introduced, for all i= 1, . . . , n, the objective function is replaced byn_i₌₁wi, and the

following constraints are added:

wi ≥ pi− y2− λi− Mi(1 − vi) ∀i = 1, . . . , n

wi ≥ −λivi,

where Mi ≥ maxjpi− pj2− λi, ∀i = 1, . . . , n. Cardinality and integrality constraints

on y, v are maintained in the reformulation. It is then expressed as: min n i=1 wi s.t. wi ≥ pi− y2− λi− Mi(1 − vi) ∀i = 1, . . . , n wi ≥ −λivi ∀i = 1, . . . , n n i=1 vi ≥ g vi ∈ {0, 1} ∀i = 1, . . . , n y∈Rs wi ∈Rn ∀i = 1, . . . , n (10)

Problem (10) is a convex MINLP, as it involves continuous and integer variables as well as nonlinear terms, and its continuous relaxation is a convex nonlinear problem since: (i) norm functions are convex [5], (ii) adding a linear function to a convex one results in a convex function, and (iii) bounding convex functions from above defines a convex set.

We tested Bonmin and our specialized branch-and-bound on solving the first pricing problem for which the VNS heuristic fails to find a negative cost solution. The results are summarized in Table4. Column optalg1refers to the optimal solution obtained by

Algo-rithm 1, whereas columns Timealg1and #bbalg1refer to the CPU time spent by the algorithm

(15)

Table 4 Algorithm 1 and Bonmin results on solving a pricing problem

Data set g opt_alg1∗ Timealg1(s) #bbalg1 solbonmi n #bbbonmi n

Vertebral column 3 −30.04 0.01 610,766 0.00 12,059

5 −35.70 0.28 ≈20 million −18.54 16,800

Body measurements 3 −2.57 0.02 ≈1.5 million 0.00 9,662

5 −12.05 0.66 ≈47 million 0.00 9,007 Tarragona 3 −0.07 0.07 ≈6.5 million 49.10 697 5 −112.61 5.11 ≈379 million −112.61 811 Census 3 −11.57 0.09 ≈8.5 million 6.88 41 5 −4.35 5.29 ≈377 million 24.53 1, 152 Eia 3 −4.00 0.87 ≈65 million 11.77 113 5 −29.46 45.99 ≈2 billion 47.74 95

of the instances within a time limit of 1 h of computation. Consequentely, column solbonmi n

reports the best solution obtained by the solver within this time limit, and column #bbbonmi n

gives the number of branch-and-bound nodes solved.

We note from Table4that our specialized branch-and-bound requires the solution of a large number of nodes, which increases rapidly with g. However, this is not a limiting factor for its performance since the nodes are very easily solved (each node requires just a few arithmetic operations). In particular, although Bonmin was able to find the optimal solution for the Tarragona instance with g= 5, it could not guarantee its optimality within the time limit (the lower bound was−627.36 when the algorithm was halted).

7.3 Upper bounds

In order to obtain upper bounds to the multivariate microaggregation problem, a binary interger program (BIP) was solved with the columns generated by the CG algorithm. Since most of these columns tend to have g non-zero elements, we also included in the BIP, for each column generated, another column in which the point that yields the least increase in the cost of that column is added to it. Table5presents upper bound results with the described heuristic. Column I L∗_ubrefers to the best known upper bound solution taken from [27]. Column gap1

reports the relative differences (in %) between I L∗_ubvalues and the I L_lbwclower bound values presented in Table3, calculated as(I L∗_ub− I L_lbwc)/I Lwc_lb. These two columns are left in blank for data sets Vertebral column and Body measurements since, to the best of our knowledge, it is the first time they are used in the microaggregation literature. Column I LC G_ub shows the upper bound solution values obtained by our heuristic whereas gap2 presents the relative

differences (in %) to I L_lbwclower bounds, calculated as(I LC G_ub − I Lwc_lb)/I L_lbwc. Finally, the last column presents the computing times (in seconds) spent by the heuristic on solving its associated BIP, i.e., it excludes the CPU time spent by the CG algorithm. We used CPLEX 12 to solve the BIPs with a CPU time limit of 1 h, which is the CPU time reported if CPLEX is halted before solving the program.

Our intention in devising a heuristic to (1) was only to check if the lower bounds obtained by the CG algorithm were tight. If this is true, it is possible to obtain upper bounds close to those lower bounds. From Table5, we remark that the heuristic cannot compete with the state-of-the-art heuristics found in the literature in terms of computing times, since these

(16)

Table 5 Upper bound results

Data set g I L∗_ub gap1(%) I LC Gub gap2(%) Time(s)

Vertebral column 3 11.31 0.18 4.05 5 15.56 0.32 30.42 Body measurements 3 2.26 0.13 4.34 5 3.64 1.02 3,600.00 Tarragona 3 16.36 13.37 14.46 0.21 7.93 5 21.72 7.90 20.16 0.15 3,600.00 Census 3 5.53 18.67 4.67 0.21 3,600.00 5 8.58 18.83 7.36 1.94 3,600.00 Eia 3 0.40 16.57 0.35 1.74 3,600.00 5 0.87 24.29 0.81 15.71 3,600.00

Table 6 Upper bound results

from [20] Data set g I L

Tarragona 3 5.49 5 10.87 Census 3 1.78 5 2.69 Eia 3 0.21 5 0.43

methods scale to instances with approximately 100,000 points. However, we observe from the same table that the results obtained by the state-of-the-art heuristics are far from the lower bounds, more than 20 % in one case. In contrast, the results obtained by our heuristic are usually very close to the optimum solution value. We remark that for all instances, except for Eia with g= 5 which is stopped before adding all cuts of type (8) separated by the heuristic of [19], the maximum gap found was 1.94 % in instance Census with g= 5. For that instance, the gap found was 15.71 %, which corresponds to a reduction of 6.89 % in the best known upper bound result previously found in the literature. Note that the results highlighted in the table are the new best known solutions for the corresponding benchmark instances.

8 Conclusions

Microaggregation is a popular technique to promote the privacy of individual records in statistical databases. It addresses the problem of finding clusters of size greater than or equal to g which are homogeneous and well separated. Then, each point is replaced by the centroid of its cluster before data publication thereby precluding the identification of a record within a cluster. As formulated by Domingo-Ferrer and Mateo-Sanz [7], i.e., through the use of an objective function that minimizes the sum of squared distances from each point to the centroid of its cluster, the problem is NP-hard in the case of multivariate data. Thus, numerous heuristics have been proposed for its solution. However, they lack a fair comparison of how far their solutions are from the optimum values, which can be achieved only if exact solutions

(17)

and lower bounds are available. In this paper, we proposed a CG algorithm for the multivariate microaggregation problem with the objective of obtaining bounds to it.

The importance of providing lower bounds to the microaggregation problem can be illus-trated observing the upper bound results presented in [20], which are summarized in Table6. We note that all upper bound results presented in Table6are inferior to the lower bounds provided by the CG algorithm to these instances in Table3. Such kind of error would very probably be avoided if lower bounds had been previously made available in the literature.

Moreover, our results show that:

– The current state-of-the-art heuristics are usually far from optimality. This was shown by large integrality gaps close to 20 % in some instances whereas our colum generation heuristic obtains upper bounds with integrality gaps smaller than 2 %. These results allow us to conclude that the scalability of the heuristics found in the literature has been done at expense of solution quality.

– The addition of master problem cuts of type (8) improved by up to 6.5 % the lower bounds obtained by the CG algorithm.

– The use of the mathematical properties of the pricing problem by the specialized branch-and-bound proposed here made it possible to obtain lower bounds to microaggregation problems which are more than toy examples. In fact, lower bounds were provided to instances of the three most popular data sets used in microaggregation literature. Finally, it is worthy to mention that the mathematical results derived from Proposition 1 can be extended to any CG formulation of a clustering problem in which: (i) the number of clusters is not fixed, (ii) the minimum number of elements in a cluster is limited, and (iii) the cost of a cluster is computed by a separable contribution of its elements. Clustering criteria for which these three conditions hold include clique partitioning [14] and cut partitioning [37].

Acknowledgments Research of the first author has been supported by the National Council for Scientific and Technological Development—CNPq/Brazil Grant Numbers 474231/2010-0 and 305070/2011-8. The authors also thank Prof. Costas Panagiotakis for providing the Tarragona, Census and Eia datasets.

References

1. Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu., A.: Approximation algorithms for k-anonymity. J. Privacy Tech. (2005).

2. Aloise, D., Hansen, P.: Evaluating a branch-and-bound RLT-based algorithm for minimum sum-of-squares clustering. J. Glob. Optim. 49, 449–465 (2011)

3. Aloise, D., Hansen, P., Liberti, L.: An improved column generation algorithm for minimum sum-of-squares clustering. Math. Program. 131, 195–220 (2012)

4. Bonami, P., Lee, J.: BONMIN user’s manual. IBM Corporation, Tech. rep., New York (2007) 5. Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge University Press, Cambridge (2004) 6. Chang, C.C., Li, Y.C., Huang, W.H.: TRFP: An efficient microaggregation algorithm for statistical

dis-closure control. J. Syst. Softw. 80, 1866–1878 (2007)

7. Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14, 189–201 (2002)

8. Domingo-Ferrer, J., Torra, V.: Ordinal continuous and heterogeneous k-anonymity through microaggre-gation. Data Min. Knowl. Discov. 11, 195–212 (2005)

9. Domingo-Ferrer, J., Martínez-Ballesté, A., Mateo-Sanz, J., Sebé, F.: Efficient multivariate data-oriented microaggregation. VLDB J. 15, 355–369 (2006)

10. Domingo-Ferrer, J., Sebé, F., Solanas, A.: A polynomial-time approximation to optimal multivariate microaggregation. Comput. Math. Appl. 55, 714–732 (2008)

11. du Merle, O., Hansen, P., Jaumard, B., Mladenovi´c, N.: An interior point algorithm for minimum sum-of-squares clustering. SIAM J. Sci. Comput. 21, 1485–1505 (2000)

(18)

12. Elhallaoui, I., Villeneuve, D., Soumis, F., Desaulniers, G.: Dynamic aggregation of set-partitioning con-straints in column generation. Oper. Res. 53, 632–645 (2005)

13. Goffin, J.L., Haurie, A., Vial, J.-P.: Decomposition and nondifferentiable optimization with the projective algorithm. Manag. Sci. 38, 284–302 (1992)

14. Grötschel, M., Wakabayashi, Y.: Facets of the clique partitioning polytope. Math. Program. 47, 367–387 (1990)

15. Hansen, P., Mladenovi´c, N.: Variable neighborhood search: principles and applications. Eur. J. Oper. Res. 130, 449–467 (2001)

16. Hansen, P., Mladenovi´c, N., Pérez, J.: Variable neighborhood search. Methods Appl. 4OR6, 319–360 (2008)

17. Hansen, S., Mukherjee, S.: A polynomial algorithm for optimal univariate microaggregation. IEEE Trans. Knowl. Data Eng. 15, 1043–1044 (2003)

18. Heinz, G., Peterson, L., Johnson, R., Kerk, C.: Exploring relationships in body dimensions. J. Stat. Educ. 11.www.amstat.org/publications/jse/v11n2/datasets.heinz.html(2003)

19. Ji, X., Mitchell, J.E.: Branch-and-price-and-cut on the clique partitioning problem with minimum clique size requirement. Discret. Optim. 4, 87–102 (2007)

20. Kabir, E., Wang, H., Zhang, Y.: A pairwise-systematic microaggregation for statistical disclosure control. In: 2010 IEEE 10th International Conference on Data Mining (ICDM), pp. 266–273 (2010)

21. Koontz, W., Narendra, P., Fukunaga, K.: A branch and bound clustering algorithm. IEEE Trans. Comput. C–24, 908–915 (1975)

22. Laszlo, M., Mukherjee, S.: Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans. Knowl. Data Eng. 17, 902–911 (2005)

23. Liberti, L.: Reformulations in mathematical programming: definitions and systematics. RAIRO-RO 43(1), 55–86 (2009)

24. Lin, J.L., Hsieh, T.H., Chang, J.C.: Density-based microaggregation for statistical disclosure control. Expert Syst. Appl. 37, 3256–3263 (2010)

25. Marsten, R., Hogan, W., Blankenship, J.: The boxstep method for large-scale optimization. Oper. Res. 23, 389–405 (1975)

26. Oganian, A., Domingo-Ferrer, J.: On the complexity of optimal microaggregation for statistical disclosure control. Stat. J. United Nat. Econ. Com. Eur. 18, 345–354 (2001)

27. Panagiotakis, C., Tziritas, G.: Sucessive group selection for microaggregation. IEEE Trans. Knowl. Data Eng. 25, 1191–1195 (2012)

28. Rebollo-Monedero, D., Forné, J., Soriano, M.: An algorithm for k-anonymous microaggregation and clustering inspired by the design of distortion-optimized quantizers. Data Knowl. Eng. 70, 892–921 (2011)

29. Rocha Neto, A., Barreto, G.: On the application of ensembles of classifiers to the diagnosis of pathologies of the vertebral column: A comparative analysis. IEEE Lat. Am. Trans. 7, 487–496 (2009)

30. Ryan, D., Foster, B.: An integer programming approach to scheduling. In: A. Wren (ed.) Computer Scheduling of Public Transport Urban Passenger Vehicle and Crew Scheduling, pp. 269–280. North-Holland (1981)

31. Solanas, A., Gavalda, A., Rallo, R.: Micro-som: a linear-time multivariate microaggregation algorithm based on self-organizing maps. LNCS 5768, 525–535 (2009)

32. Solanas, A., Martinez-Balleste, A., Domingo-Ferrer, J.: V-MDAV: A multivariate microaggregation with variable group size. In: 17th COMPSTAT Symposium of the IASC (2006)

33. Solanas, A., Martínez-Ballesté, A., Domingo-Ferrer, J., Mateo-Sanz, J.: A 2d-tree-based blocking method for microaggregating very large data sets. In: Proceedings of the First international conference on avail-ability, reliability and security (2006)

34. Sun, X., Wang, H., Li, J., Zhang, Y.: An approximate microaggregation approach for microdata protection. Expert Syst. Appl. 39, 2211–2219 (2012)

35. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Syst 10, 557–570 (2002)

36. Willenborg, L., DeWaal, T.: Elements of statistical disclosure control. Springer, New York (2001) 37. Wu, Z., Leahy, R.: An optimal graph theoretic approach to data clustering: theory and its application to