Heuristics for minimizing the maximum within-clusters distance

Fioruci, José Augusto; Toledo, Franklina M.B.; Nascimento, Mariá Cristina V.

doi:10.1590/S0101-74382012005000023

Abstract

The clustering problem consists in finding patterns in a data set in order to divide it into clusters with high within-cluster similarity. This paper presents the study of a problem, here called MMD problem, which aims at finding a clustering with a predefined number of clusters that minimizes the largest within-cluster distance (diameter) among all clusters. There are two main objectives in this paper: to propose heuristics for the MMD and to evaluate the suitability of the best proposed heuristic results according to the real classification of some data sets. Regarding the first objective, the results obtained in the experiments indicate a good performance of the best proposed heuristic that outperformed the Complete Linkage algorithm (the most used method from the literature for this problem). Nevertheless, regarding the suitability of the results according to the real classification of the data sets, the proposed heuristic achieved better quality results than C-Means algorithm, but worse than Complete Linkage.

clustering; heuristics; GRASP; minimization of the maximum diameter

Heuristics for minimizing the maximum within-clusters distance

José Augusto Fioruci^I; Franklina M.B. ToledoI,^* * Corresponding author thanks ; Mariá Cristina V. Nascimento^II

^IInstituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, SP, Brasil. E-mails: jaf@icmc.usp.br; fran@icmc.usp.br

^IIInstituto de Ciência e Tecnologia, Universidade Federal de São Paulo, 12231-280 São José dos Campos, SP, Brasil. E-mail: mcv.nascimento@unifesp.br

ABSTRACT

The clustering problem consists in finding patterns in a data set in order to divide it into clusters with high within-cluster similarity. This paper presents the study of a problem, here called MMD problem, which aims at finding a clustering with a predefined number of clusters that minimizes the largest within-cluster distance (diameter) among all clusters. There are two main objectives in this paper: to propose heuristics for the MMD and to evaluate the suitability of the best proposed heuristic results according to the real classification of some data sets. Regarding the first objective, the results obtained in the experiments indicate a good performance of the best proposed heuristic that outperformed the Complete Linkage algorithm (the most used method from the literature for this problem). Nevertheless, regarding the suitability of the results according to the real classification of the data sets, the proposed heuristic achieved better quality results than C-Means algorithm, but worse than Complete Linkage.

Keywords: clustering, heuristics, GRASP, minimization of the maximum diameter.

1 INTRODUCTION

The data clustering problem aims at identifying in a given data set similar characteristics among its objects in order to divide them into clusters. An example of similarity is the proximity of objects in the data set, that is, the objects within a specific group should be closer to each other than to objects located in other clusters. Some data clustering applications are: data mining (Boginski et al., 2006; Romanowski et al., 2006), multiple protein sequence alignment (Kawaji et al., 2004; Krause et al., 2005), gene expression (Higham et al., 2007; Huttenhower et al., 2007) and image segmentation (Wu & Leahy, 1993), among many others.

Data clustering is not an easy task because, in addition to the different sizes of most data sets, their clusters are often not clearly identifiable. Thus, several models to approach data clustering problem have been proposed in literature (Hansen & Mladenovic, 2001; Hansen & Jumard, 1997; Jain & Dubes, 1988; Rao, 1971). Each of these models has its bias, showing a better performance for specific types of data sets depending on the similarity measure adopted.

Some mathematical models for the data clustering problem are presented in (Rao, 1971). Particularly, the author proposed a mathematical model whose objective is to find a partition for a given number of clusters (defined a priori) that minimizes the maximum diameter (MMD) among all the clusters. In this formulation, only the distances between pairs of objects are considered, not necessarily their exact positions in the feature space. Rao (1971) also proposed a simple and efficient optimal algorithm to solve this problem when only two clusters are considered in its definition. Considering the combinatorial nature of the MMD problem, whose problem is to decide whether or not its solution is NP-hard, this paper proposes four heuristic methods for the clustering problem: two greedy heuristics (CH and GHLS) and two Greedy Randomized Adaptive Search Procedures (Feo & Resende, 1995; Resende & Ribeiro, 2010) (GRASP-I and GRASP-II).

For assessing the suitability of the solution methods proposed for the MMD problem, we use some instances from the literature (Nascimento, 2010) in the computational tests. These instances were produced for the analysis of the network community detection problem. The results from applying the methods to these data sets were evaluated according to two criteria: 1) the objective function value (the value of the largest diameter); and 2) the adequacy of the solution found using an external evaluation criterion for data clustering, the Normalized Mutual Information index, here referred to as NMI (Danon et al., 2005). For the first experiment, we evaluated the solutions found by the proposed heuristics and compared with each other. In the second experiment, we used the software CPLEX 12.2 (IBM ILOG, 2010) in order to look for the optimal solutions of the data sets. In this experiment, we compared the solutions found by the best of the four proposed heuristics, the upper bounds found by CPLEX and the results of a benchmark heuristic for the MMD problem, the Complete Linkage (Hansen & Delattre, 1978). For the third experiment, to assess the adequacy of the solutions according to NMI, the partitions found by Complete Linkage, a well known algorithm in the literature, C-Means (Bezdek, 1981), the best proposed heuristic in this paper, GRASP-II were compared with the real data classification.

The results showed that GRASP-II achieved better solutions than others proposed heuristics for 85% of the data sets. Moreover, this metaheuristic got an exact solution for all twelve data sets, for which CPLEX provided exact solutions. However, the Complete Linkage was the heuristic that showed the best performance according to the NMI, while GRASP-II obtained better quality results than C-Means and CPLEX. Note that these conclusions are with regard to the application and specific characterization of the evaluated graphs.

The remainder of the paper is organized as follows. Section 2 describes the mathematical model proposed by Rao (1971) for the studied problem. For this paper to be self-explanatory, Section 3 presents the optimal method proposed by Rao (1971) for the 2-Clusters problem. The proposed heuristics are detailed in Section 4. An algorithm for the Complete Linkage method is provided in Section 5. The computational results are reported in Section 6. To sum up, some concluding remarks are addressed in Section 7.

2 MATHEMATICAL MODEL

Rao (1971) proposed a mathematical formulation for the cluster analysis whose objective is to find a partition from a data set that minimizes the maximum within cluster distance. This problem is also known as the minimization of the largest diameter among the clusters. The mathematical formulation is described by (1)-(4) and the notations are given as follows:

Parameters

N - number of objects of the data set;

M - number of clusters of the final partition;

d_ij - distance between objects i and j;

Variables

x_ik - binary variable that object i to cluster k(x_ik = 1, if object i is in cluster k; 0 otherwise);

Z - value of the largest diameter among the M clusters (continuum variable);

The objective function (1) minimizes the value of Z that represents the maximum within clusters distance. Constraints (2) ensure that the value of the largest cluster diameter is lesser than or equal to the Z variable value. Constraints (3) impose that each object belongs to just one cluster. Constraints (4) enforce the non-negativity of the variable Z and the binary nature of the other variables.

Brusco & Stahl (2005) point out that this problem has the tendency to produce clusters with just one object (isolated clusters). The authors proposed a specialized branch-and-bound (B&B) method to solve the problem. However, we evaluated this algorithm and for data sets with hundreds of objects the elapsed time to find the optimal solution was very high. We also considered the B&B with a good initial solution in order to fasten its convergence. However, the results remained poor. For example, for an instance with 336 objects and 8 groups, we could not obtain the optimal solution after two days trying to solve the problem. For this reason, to compare the heuristic results with the optimal solution, here, we used the software CPLEX 12.2(IBM ILOG, 2010).

In this paper, we use the usual definition of partition to represent a clustering. In this case, a partition is the set of groups G₁, G₂, ..., G_M, where the elements of G_i are objects that belong to a same cluster and if i ≠ j, elements from G_i belong to a cluster different to the elements from G_j. Moreover, the union of all G_i's is the whole set of objects from the data set, and the intersection of these sets is empty.

3 2-CLUSTERS ALGORITHM FOR MMD PROBLEM

Rao (1971) proposed a polynomial optimal algorithm to solve the MMD problem for instances where the number of clusters in the final partition is equal to two. Here, this algorithm is called 2-Clusters algorithm. In this paper we propose four heuristics based on the 2-Clusters algorithm. The 2-Clusters algorithm is based on a simple idea: at each step it tries to assign the two most distant objects to different clusters.

In our description of the 2-Clusters algorithm, we maintain the same notation and nomenclature used in Section 2. In addition, D is the matrix of the pairwise distances between objects, labels A and B are used to define the two definitive clusters and R(i) is the function that defines the label of object i. For example, if object i is labeled as A, then R(i) = A. In this case, object i is definitely assigned to cluster A. It is also possible to assign object i temporarily to one cluster. For this situation the author used a temporary label k, that is, R(i) = k, where k is an integer number between -N and N. The 2-Clusters algorithm is described next.

In Figure 1, we show the step by step execution of the 2-Clusters algorithm using a hypothetical data set. In each step, the arrow signs the two objects with the largest paired distance. Circles with gray and black nucleus represent the definitive labels A and B, respectively. Temporary labels (k) are represented by squares and triangles.

In the first step (Fig. 1a), all objects are unlabeled. In Figure 1b, the pair of objects with the largest paired distance receives the definitive labels (black and gray). This distance is updated as zero in the distance matrix. In Figure 1c, the new pair of objects with the largest distance is selected and they are labeled to belong to different clusters. After that, the distance between this pair of objects is updated as zero in the distance matrix. In this case, this pair of objects receives different temporary labels, which are the black and the gray squares. This process is repeated in Figure 1d, now with triangle labels instead of squares. In Figure 1e, the largest distance corresponds to the distance between an unlabeled object and an object labeled with a gray color. Consequently, the unlabeled object receives the definitive label different from the gray one, that is, the black label. The largest distance in Figure 1f is determined by an object with a black triangle label (see Fig. 1d) and an object with the definitive black label. Therefore, the object with the temporary label receives the definitive gray label and all other objects with black triangle labels also receive the definitive label gray. In addition, all objects with gray triangle labels receive the final black label. In Figure 1g, the objects that determine the largest distance already have final labels and we just assign distance zero between them to matrix D. The same process as Figure 1g, but with different objects, occurs in Figure 1h. Thus, the object with the black squared label receives the black definitive label and the object with the gray squared label receives the gray definitive label. In the end, all the objects have definitive labels and the final partition with 2 clusters is illustrated in Figure 1i.

According to Rao (1971), the optimality proof for 2-Clusters algorithm is obvious. Nevertheless, in this paper we present a formal proof of the optimality for this algorithm. For such, consider first the definition of partition. A partition {G₁, G₂,..., G_M} is defined as a set of M non-empty clusters. In these clusters all two by two cluster intersections is empty and the union of all clusters results in the data set.

Theorem: Let a data set be with n elements. Its partition π = {G₁, G₂} produced by the 2-Clusters algorithm is optimum.

Proof: Let L = {d₁, d₂, ..., d_m} be the m elements from matrix D sorted in decreasing order. Let π be a partition produced according to the 2-Clusters algorithm and Z_π be its largest diameter. Suppose that there is a partition π' with the largest diameter equals to Z_π_' and Z_π_' < Z_p. Thus, objects i and j that are responsible for the diameter of π', i.e., such that d_ij = Z_π_' in partition π', are assigned to different clusters. By construction, i and j were assigned to a same cluster in π, suppose, without loss of generality, to cluster G₁, by the following reason: there is a d_k from L such that d_k = d_ij = d_k', with k < k' and dk = max{d_ir, d_jr where r belongs to G₂}. If k' < k, by construction, i and j would be assigned to different clusters at iteration k'. Therefore, the value of Z from π' would be at least d_k that is greater than Z from partition π, that is a contradiction.

Taking the complexity of the 2-Clusters algorithm into account, Step 2 is very important. In this step, we look for the pair of objects with the largest distance in D. Because D is a symmetric matrix, in the worst case, it is necessary to inspect half of the elements of D. As a consequence, this procedure has complexity O(n²) where n is the order of D. A sequential search in D leads to an O(n⁴) algorithm. A binary heap improves the running time to O(n² log n), and an even better theoretical amortized bound of O(n²) may be reached with a Fibonacci heap.

In the following section, we present the four heuristic methods proposed to solve the MMD problem for finding partitions with more than two clusters.

4 PROPOSED METHODS

The heuristics proposed in this paper find a partition in a data set into M > 2 clusters to approximately solve the MMD problem, whose mathematical formulation was presented in Section 2. In this paper, four heuristics are proposed: a greedy constructive, which applies the 2-Cluster algorithm several times in the largest diameter clusters until obtaining the solution with the desired number of clusters; a greedy heuristic with local search, which starts with an infeasible solution since the initial solution has a higher number of clusters than allowed, and at each iteration, groups pairs of clusters performing a local search; and two GRASP heuristics, which are based on the constructive heuristic and the greedy heuristic with local search, making the union of clusters in a randomized greedy form. Next, the heuristics proposed are detailed.

4.1 Constructive Heuristic (CH)

The constructive greedy heuristic proposed in this paper uses the 2-Clusters algorithm recursively. In the first step, the 2-Clusters algorithm is used to divide the data into two clusters. Next, the algorithm is applied again to the largest sized group and so on, until the M clusters are obtained.

4.2 Greedy Heuristic with Local Search (GHLS)

A local search heuristic means looking for a better solution in the neighborhood of a given solution. Then, the solution is replaced by the neighboring solution and the process is repeated until the current solution does not have better neighboring solutions. Considering that the used model seeks to minimize the largest diameter among the M clusters from a partition, neighboring solutions with lower maximum diameter value are sought.

The proposed local search was defined based on the neighborhood obtained using the following movement: remove the pair of elements that defines the largest diameter among all clusters and allocate them to other clusters so that the largest diameter is reduced. The local search is applied to a partition of the objects from the data set. The local search proposed is detailed in Algorithm 2.

The starting point of GHLS is a partition with ⌊K^*M⌋ clusters generated from the previously described CH, where K ∈ is chosen experimentally as reported in Section 6. The constant ⌊K^*M⌋ indicates the greatest integer value lesser than or equal to K^*M. The two clusters whose union results in the smallest increase in the objective function are then selected and grouped. Notice that this is a greedy strategy. As a result, we obtain ⌊K^*M⌋- 1 clusters. This process is repeated until we obtain a feasible solution, that is, a partition with M groups. Thus, we have the Greedy Heuristic with Local Search algorithm described by Algorithm 3.

4.3 GRASP

GRASP (Feo & Resende, 1995) is a metaheuristic consisting of two phases: a constructive and local search ones. The constructive phase consists in obtaining, iteratively, a pseudo-greedy solution. At each iteration of this phase, a set S of all possible choices of elements to be added to the current partial solution is evaluated. One from the best t, with t < N, options is drawn and added to the partial solution. A shortlist of candidates is designated to this restricted list of candidates (RLC). At the end of this phase, a feasible solution to the problem is found, for which is applied a local search procedure.

The GRASP metaheuristic is known for finding good quality solutions in various optimization problems (Nascimento et al., 2010; Marinakis et al., 2008). For this reason, two GRASP metaheuristics for the MMD problem are proposed in this paper. The proposed constructive phase of GRASP is the GHLS with Step 4 modified. In this modification, instead of grouping the clusters of a partition π that produces the lowest maximum diameter, two clusters that provide one of the smallest t maximum diameters are grouped. The pseudo code of the constructive phase of the proposed metaheuristics is detailed in Algorithm 4. Note that in Step 4, the size of vector V is equal to the combination of m pairs of objects.

In Step 3 of Algorithm 4, α changes in all iterations, thus this algorithm can be considered a reactive GRASP (Resende & Ribeiro, 2010). In Algorithm 4 it is necessary to build a sorted vector with the values of the diameters resulted from the pairwise junction of all clusters from π. To construct the vector V it is necessary to scan the distance matrix D and to sort it. Thus the order of Algorithm 4 is O(n²+ m^* log(m)).

The two GRASP (GRASP-I and GRASP-II) proposed in this paper differ because GRASP-II applies the local search to the partial solutions from the constructive phase and GRASP-I does not. At the end of the constructive phase of the proposed metaheuristics, at each iteration, a feasible solution is obtained, which is improved by the local search procedure. Several iterations of these steps are carried out and the highest quality solution is kept. The proposed metaheuristics are detailed in the pseudo codes presented, respectively, in Algorithms 5 and 6.

In Algorithms 5 and 6, the function Z_π gives the value of largest diameter among the clusters from partition π; the constants Max_it and INFINITY are, respectively, the maximum number of iterations of the proposed GRASP and a large initial value for the empty solution.

The algorithms GRASP-I and GRASP-II use the constructive heuristic to generate an initial partition with ⌊K^*M⌋ clusters. The order of the best case of this phase is O(⌊K^*M⌋^*n²) and in the worst case, its order is O(⌊K^*M⌋^*n⁴) (see Step 2). In each of the Max_it iterations of both metaheuristics, the constructive phase is repeated ⌊K^*M⌋ - M. The Local Search phase performed once in each iteration of GRASP-I, whereas in GRASP-II, the Local Search repeated ⌊K^*M⌋ - M (the same number of times as the constructive phase). Considering m the number of clusters, in the best case, the Local Search performs only one iteration with order O(m^*n²). In the worst case, although unlikely, all pairs of objects can be changed having the order O(m^*n⁴). In the best case, GRASP-I has an order of O(Max_it^*((⌊K^*M⌋ - M)^*(n² + m^*log(m)) + m^*n²)) and, in the worst case, has an order of O(Max_it*((⌊K^*M⌋ - M)^*(n²+ m^*log(m)) + m^*n⁴)). In the best case, GRASP-II has an order of O(Max_it^*(⌊K^*M⌋ - M)^*(m^*log(m) + m^*n²)) and, in the worst case, it has an order of O(Max_it^*(⌊K^*M⌋ - M)^*(m^*log(m) + m^*n⁴)). Therefore, if we consider only the number of objects (n), the proposed GRASP heuristics have the same order of complexity as the 2-Clusters Algorithm, presented in Section 3.

5 COMPLETE LINKAGE

Complete Linkage is a hierarchical agglomerative method for data clustering (Mingoti, 2007). An agglomerative hierarchical method considers, initially, each element of the data set with n objects as a group with a single object (initial clustering with n isolated groups). In the second iteration of the algorithm, two clusters are chosen, according to a measure of similarity (or dissimilarity), to be joined thus forming an n - 1 cluster. The process is repeated until a group of only one cluster is obtained that contains all the objects, therefore, in n iterations.

The hierarchical property consists in the fact that at a certain iteration of the algorithm a pair of the elements appears in the same group. These elements will be kept together in the subsequent iterations. This property allows the construction of a tree of unions (dendrogram) that occurred in the iterations of the algorithm. Thus a partition with M clusters (M > 1) can be obtained by cutting the tree at the (n - M + 1)-th iteration.

In the Complete Linkage the similarity between two clusters G₁ and G₂ is defined as the largest distance between an object of G₁ and an object of G₂. In other words, the similarity between two clusters is defined as the diameter of the cluster resulting from the union of these clusters. Therefore, Hansen & Delattre (1978) suggest that the Complete Linkage can be considered as a heuristic for the MMD problem. Its pseudo code with the tree cut for M clusters is shown in Algorithm 7.

6 COMPUTATIONAL TESTS

The proposed heuristics were programmed in C and the computational experiments were performed on a microcomputer Core 2 Duo, 2.0 GHz with 3 GB of random access memory (RAM) under Windows 7 operational system. The tests using the software CPLEX version 12.2 were carried out on a cluster IBM Quad-core Intel (R) Xeon(R) CPU E5504 2.00 Ghz Linux with a processor with 2Gb of RAM. The values used for the parameter K of the pseudo code GRASP-I and GRASP-II were: 1.5; 2 and 3. Other values of K were used in preliminary tests and, overall, the results obtained with K < 1.5 showed lower quality than those obtained when K > 1.5. When K > 3, the results did not show significant improvement to justify its use due to the considerable computational time increase. The best results were obtained for K equal to 2. Therefore, this value was adopted for the three proposed heuristics: GRASP-I, GRASP-II and GHLS. The maximum number of iterations of GRASP-I and GRASP-II, Max_it, was defined as 100, a value adjusted by means of computational experiments. In these experiments, we considered the trade-off between solution quality and necessary computational time to achieve the final solution.

For the computational tests, we used 60 graphs with the number of nodes ranging from 100 to 1000 containing structures from 3 to 50 clusters. Here, we will refer the nodes of the graphs as objects when using data clustering algorithms. These artificial graphs were generated by Nascimento (2010) using the following systematic: let A = {100, 200, 300,...,1000} be the set of number of nodes and B = {3, 4, 5, 10, 20, 50} be the set of number of clusters. For each x ∈ A, there is a single graph with y ∈ B clusters. An example of a graph with 100 nodes and structure with 3 clusters is presented in Figure 2.

As the original measurement of weights between two nodes is relative to the similarity between them, a conversion was necessary to measure the distance between them. For such, the following conversion formula was used:

if w_ij ≠ 0, d_ij = INFINITY, w_ij = 0. To obtain the distance matrix between all nodes, after this step, the shortest path between each pair of nodes was considered, which can be calculated using the Dijkstra's algorithm. With these distance matrices (referred here to as data sets), the experiments were developed. The results of the heuristics were evaluated according to two criteria: 1) the quality of the solutions obtained according to the objective function value; and 2) the adequacy of the solution according to NMI (Danon et al., 2005), which is an external evaluation criteria for data clustering. In the first case, to assess the proposed methods' solution quality, their results were compared with each other and with Complete Linkage (Hansen & Delattre, 1978), described in Section 5. The implementation used for this algorithm is available in the cluster package of R-project software (Ihaka, 1993). In addition, the results were compared with the optimal solution of the problems. Thus, the problems were solved using the optimization CPLEX 12.2 software (IBM ILOG, 2010). Also, to evaluate the suitability of our algorithm we considered the classical algorithm from literature, C-Means (Bezdek, 1981), whose used implementation is from the e1071 package of R-project. Roughly speaking, the Fuzzy C-Means, or simply C-Means algorithm, was proposed in Bezdek (1981). This algorithm follows the Fuzzy logic, where each object belongs to one cluster with certain degree of membership and one object can belong to more than one cluster. The basic idea of the algorithm is to start with the random central points of the clusters and go to updating these points according to an objective function. The C-Means algorithm searches for a partition that minimizes the objective function that is average of the squares of the distances of each object to the centers of all clusters weighted with the degree of pertinence of each object with respect to the group.

The computational tests were performed in three experiments. In the first experiment, the four proposed methods (CH, GHLS, GRASP-I and GRASP-II) were applied to the described data sets. Furthermore, their solutions were compared with each other to verify their performance and to analyze the best proposal. In the second experiment, the solutions obtained by the best method were compared with those obtained by the Complete Linkage algorithm. In addition, both methods were evaluated with the solutions obtained by the CPLEX optimization software with the computational time limited to 3 hours. Finally, at the third step, the results obtained by the best proposed method, CPLEX, Complete Linkage and C-Means were evaluated regarding their classification suitability according to the NMI.

In the first two experiments, the gap was used as the main comparative variable. The used gap formula is:

where heur_sol is the heuristic solution and best_heur is the best heuristic solution.

6.1 Comparison between the proposed methods

Figure 3 illustrates the results obtained with the proposed heuristics, considering its objective function, that is, the variable Z value of the problem. In these graphs, one can observe the superiority of the solutions found by GRASP-II over the other heuristics. For 59 of the 60 data sets, GRASP-II had the results better than or equal to the solutions from the other heuristics. GRASP-I and GHLS had the results better than or equal to the solutions from the other heuristics in 25 and 15 instances, respectively. For only one of the instances CH had the best result and, in 7 data sets, its results were equal to the best solution found.

Regarding the computational time, as CH and GHLS are simplifications of GRASP-I, both had the computational time lower than the computational time of GRASP-I. In the worst case, they took about six minutes to run. GRASP-II is computationally more expensive than GRASP-I, because GRASP-II applies local search every time two clusters are grouped in the construction phase. In addition, in GRASP-II, the number of cluster unions in each iteration is proportional to the number of clusters of the final partition (M). This means that the computational cost increases with the number of clusters. Therefore, it is expected that in problems with many clusters, the computational cost of GRASP-II is high. The highest computational cost of GRASP-II occurred for the instance with 1000 objects and 50 clusters, when the method required about 25 minutes to complete the run. For additional analysis, Table 1 from the appendix of this paper reports the objective function values of the solutions found by the heuristics and their computational time.

Thumbnail

As GRASP-II generated the best results for most of the assessed problems, the next experiments were performed only for this metaheuristic.

6.2 Comparison of GRASP-II, Complete Linkage and CPLEX

In order to assess the quality of solutions obtained by GRASP-II, we compare them with the solutions obtained by the heuristic from literature, the Complete Linkage (Hansen & Delattre, 1978). The results of this comparison can be observed in Figure 4. In Figure 4, the displayed graphs show the gaps obtained by the Complete Linkage and GRASP-II heuristics with relation to the best feasible solution found among the three methods: Complete Linkage, GRASP-II and CPLEX. GRASP-II had a mean gap equals to 0.3%, a standard deviation of 1.1% and a maximum gap equals to 5.6%. Complete Linkage obtained a mean gap of 8.4%, a standard deviation of 11.3% and a maximum gap of 44.0%. In addition, according to the 60 assessed problems, 7 of them obtained the same results using Complete Linkage and GRASP-II. For only 9 instances, the Complete Linkage results were better than the results from GRASP-II.

Analyzing the results obtained by CPLEX, only 12 optimal solutions were found by it. For these instances, GRASP-II also found the optimal solutions, whereas Complete Linkage obtained the optimal solution for four of these instances. Regarding the other instances, for only 3 of them GRASP-II obtained worse solutions than CPLEX, whereas Complete Linkage obtained 16 inferior results. For more details about these solutions, the authors recommend analyzing the values reported in Table 2 of the appendix of this paper. This table presents the results obtained by the heuristics, the bounds obtained by CPLEX, and the execution time of each tested method.

Thumbnail

According to these results, we can conclude that GRASP-II had an excellent performance, despite the higher computational time in some test cases. Thus, one can consider this metaheuristic as a better alternative than Complete Linkage, a classical heuristic from literature.

Next, other criteria are evaluated with respect to the results of Complete Linkage, CPLEX and of the best heuristic proposed in this paper, GRASP-II. For this assessment we used an external evaluation criterion, consistent with the classification of data sets.

6.3 Assessment of the solutions according to NMI

To evaluate the results of GRASP-II with respect to the real classification of the used data sets, an external evaluation criterion for data clustering was used, the Normalized Mutual Information (NMI) (Danon et al., 2005). The use of this measure is inspired on the study performed by Lancichinetti & Fortunato (2009). The authors compare partitions from data sets found by clustering algorithms taking the known labels of the data sets into account. For such, first consider a partition as a vector containing the labels (the cluster number) at the corresponding position of the object. For example, if the i-th object belongs to the k-th cluster, then the i-th position of its label vector is k. Having this definition, let π⁽¹⁾ and π⁽²⁾ be partitions whose label vectors are represented, respectively, as p⁽¹⁾ and p⁽²⁾. The NMI, the normalized form of the Mutual Information measure, that estimates the dependence of two random variables used in this paper is presented in Equation (5).

where X and Y are the random variables that describe p⁽¹⁾ and p⁽²⁾, respectively,

and

The closer to 1, the better is the clustering according to the original labels. There are other versions for the normalization of NMI, however, the one presented is better for comparing partitions which are not guaranteed for having balanced clusters. The case of the data sets from our experiments. The graphs in Figure 5 show the results achieved by the GRASP-II, Complete Linkage, C-Means and CPLEX methods, according to NMI.

For only 14 instances the Complete Linkage obtained lower results than the other heuristics. GRASP-II, C-Means and CPLEX obtained best solutions for 9, 4 and 1, respectively. Even for the 12 instances for which CPLEX and GRASP-II obtained optimal solutions, the Complete Linkage still had better NMI mean values. The mean NMI of GRASP-II and of C-Means were 0.7 and 0.6, respectively, with standard deviations of 0.2, for both, whereas for the Complete Linkage, the mean NMI was 0.9, with standard deviation of 0.1. CPLEX showed the worst results in this experiment, with a mean NMI of 0.3, and a standard deviation of 0.3. It should be noticed that Complete Linkage achieved better results than GRASP-II in 83% of the problems, while C-Means and CPLEX obtained for 92% and 95%, respectively. Even though GRASP-II has worse results than Complete Linkage, its results were higher than 0.7 for 58% of the instances. These data can be found in Table 3 of the appendix of this paper.

Thumbnail

To sum up, it must be highlighted that GRASP-II was the solution method that obtained the best results among all the tested methods. In other words, this metaheuristic found better solutions for the studied integer problem. Furthermore, it is possible that if we find partitions with different number of clusters for a same data set, the NMI results of the proposed heuristic may improve. Moreover, possibly, more robust solutions could be achieved by the heuristics regarding the external evaluation criterion, NMI.

7 FINAL REMARKS

In this paper we studied the data clustering problem based on formulation which aims at the minimization of the largest diameter of a partition. To solve it approximately, four heuristics were proposed and assessed: two greedy heuristics (CH and GHLS) and two GRASP metaheuristics (GRASP-I and GRASP-II). In the first experiment, the performance of the methods was comparatively evaluated. The results showed that the solutions obtained by GRASP-II were superior to the other proposed methods. However, as expected, this method was computationally the most expensive. In the second experiment, the main purpose was to compare the objective function value obtained by GRASP-II with respect to the solutions of the Complete Linkage method and to the optimal solutions found by the optimization software CPLEX. GRASP-II proved to be very efficient in 85% of the cases obtaining better results than Complete Linkage. Moreover, it found the optimal solutions for the twelve problems for which CPLEX provided the optimal solutions. In the third and last experiment, we performed a suitability test using an external validation criterion for data clustering, the Normalized Mutual Information (NMI). In this experiment, the partitions found by GRASP-II, CPLEX, Complete Linkage and C-Means were validated with the original partition of the data set. In this case, it was observed that, for the studied data sets, Complete Linkage obtained solutions closer to the real classificationthan GRASP-II and C-Means.

The results of this study indicate that, for a considerable large number of instances, the MMD problem can be solved efficiently by the GRASP-II. Moreover, by the first two experiments, GRASP-II achieved the best results in this paper. The improvement we aimed is with regard to the internal validation criterion, that is, the objective function of the MMD problem. Thus, the proposed metaheuristic is characterized for being highly efficient. Nevertheless, even though the results obtained by the proposed metaheuristic according to the external validity criterion had good quality, they were lower than those of the Complete Linkage. These values were closer to the results obtained by C-Means and CPLEX in the problems for which it found the feasible solution. These results may indicate that the MMD problem perhaps is not the most appropriate formulation for the data clustering problem.

ACKNOWLEDGMENTS

The authors would like to thank the anonymous referees for their helpful comments which significantly improved the quality of this paper. This research was funded by the Brazilian FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo) and CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico).

APPENDIX A: NUMERICAL RESULTS

In this Appendix, we present the numerical results obtained in the experiments. The values reported in Tables 1, 2 and 3 correspond to the data from the graphs displayed in Figures 3, 4 and 5.

Table 1 presents the results obtained by the four algorithms proposed in this paper. The first column shows the name of the used data sets. The names of these data indicate, respectively, the number of objects and its number of clusters. For example the data set 100_3 has 100 objects and 3 clusters. In the second to fifth columns, the results of the heuristics are presented, that is, their execution time in seconds (Time) and objective function value (Z). The best results are highlighted in bold.

Table 2 presents the results of GRASP-II, Complete Linkage and CPLEX. The columns GRASP-II and Complete Linkage exhibit the computational time in seconds (Time), the objective function value (Z) and the gap of each solution regarding the best solution obtained. The best objective functions and gaps are highlighted in bold. The CPLEX column shows the value of the objective function of the node with the best expectation, lower bound (LB), the value of best integer solution, higher upper limit (UB) and the computational time in seconds (Time). When the LB coincides with the UB, then the solution obtained is optimal. In this case, these results are highlighted in bold. For CPLEX is imposed a time limit of 3 hours (10.800 seconds).

Table 3 presents the values of NMI achieved by the solutions found by GRASP-II, Complete Linkage, C-Means and CPLEX. Again, the best results are marked in bold.

[1] BEZDEK JC. 1981. Pattern recognition with fuzzy objective function algorithms. New York: Plenum.
[2] BOGINSKI V, BUTENKO B & PARDALOS PM. 2006. Mining market data: A network approach. Computers & Operations Research, 33: 3171-3184.
[3] BRUSCO MJ & STAHL S. 2005. Branch and Bound applications in combinatorial data analysis. New York: Springer-Verlag.
[4] CANO J, CORDÓN O, HERRERA F & SÁNCHEZ L. 2002. A GRASP algorithm for clustering. Lecture Notes in Computer Science Springer, 214-223.
[5] DANON L, DUCH J, ARENAS A & DÍAZ-GUILERA A. 2005. Comparing community structure identification. Statistical mechanics, 9008, 09008.
[6] DUDA RO. 2001. Pattern classification. John Wiley & Sons.
[7] FEO TA & RESENDE MGC. 1995. Greedy randomized adaptive search procedures. Global Optimization, 6: 109-133.
[8] HANSEN P & DELATTRE M. 1978. Complete-link cluster analysis by graph coloring. American Statistical Association, 73: 362, 397-403.
[9] HANSEN P & JAUMARD B. 1997. Cluster analysis and mathematical programming. Mathematical Programming, 79: 191-215.
[10] HANSEN P & MLADENOVIC N. 2001. J-Means: a new local search heuristic for minimum sum of squares clustering. Pattern Recognition, 34: 405-413.
[11] HIGHAM DJ, KALNA G & VASS JK. 2007. Spectral analysis of two-signed microarray expression data. Mathematical Medicine and Biology, 24: 131-148.
[12] HUTTERNHOWER C, FLAMHOLZ AI, LANDIS JN, SAHI S, MYERS CL, OLSZEWSKI KL, HIBBS MA, SIEMERS NO, TROYANSKAYA OG & COLLER HA. 2007. Nearest neighbor networks: clustering expression data based on gene neighborhoods. BMC Bioinformatics, 8: 250.
[13] IBM ILOG. 2010. CPLEX 12.2.0 reference manual.
[14] IHAKA R & GENTLEMAN R. 1993. R Development Core Team. GNU General Public License.
[15] JAIN AK & DUBED RC. 1988. Algorithms for Clustering Data. Prentice-Hall advanced reference series. Prentice-Hall, Inc., Upper Saddle River, NJ.
[16] JAIN AK, MURTY MN & FLYNN PJ. 1999. Data clustering: a review. ACM Computing Sureys, ACM Press, 31: 264-323.
[17] KAWAJI H, TAKENAKA Y & MATSUDA H. 2004. Graph-based clustering for finding distant relationships in a large set of protein sequences. Bioinformatics, 20(2): 243-252.
[18] KRAUSE A, STOYE J & VINGRON M. 2005. Large scale hierarchical clustering of protein sequences. BMC Bioinformatics, 6: 15.
[19] LANCICHINETTI A & FORTUNATO S. 2009. Community detection algorithms: a comparative analysis. Physical Review A, 80: 056117-[11 pages]
[20] MARINAKIS Y, MARINAKI M, DOUMPOS M, MATSATSINIS N & ZOPOUNIDIS C. 2008. A hybrid stochastic genetic-GRASP algorithm for clustering analysis. Operational Research, 8: 22-46.
[21] MINGOTI SA. 2007. Análise de dados através de métodos de estatística multivariada, uma abordagem aplicada. Belo Horizonte: UFMG.
[22] NASCIMENTO MCV. 2010. Metaheurísticas para o problema de agrupamento de dados em grafo. Tese de Doutorado, Universidade de Săo Paulo.
[23] NASCIMENTO MCV, TOLEDO FMB & CARVALHO ACPLF. 2010. Investigation of a new GRASP-based clustering algorithm applied to biological data. Computers & Operations Research, 37: 1381-1388.
[24] RAO MR. 1971. Cluster analysis and mathematical programming. Journal of the American Statistical Association, 66: 622-626.
[25] RESENDE MGC & RIBEIRO CC. 2010. Greedy randomized adaptive search procedures: Advances, hybridizations, and applications. In: M. Gendreau and J.-Y. Potvin, editors, Handbook of Metaheuristics. Springer-Verlag, 2^nd edition.
[26] ROMANOWSKI CJ, NAGI R & SUDIT M. 2006. Data mining in an engineering design environment: or applications from graph matching. Computers & OR, 33: 3150-3160.
[27] ROMESBURG HC. 2004. Cluster Analysis for Researchers. Lulu Pres (re-print).
[28] WU Z & LEAHY R. 1993. An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15: 1101-1113.

*

Corresponding author thanks

Publication Dates

Publication in this collection
30 Nov 2012
Date of issue
Dec 2012

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

[1] [1] BEZDEK JC. 1981. Pattern recognition with fuzzy objective function algorithms. New York: Plenum.

[2] [2] BOGINSKI V, BUTENKO B & PARDALOS PM. 2006. Mining market data: A network approach. Computers & Operations Research, 33: 3171-3184.

[3] [3] BRUSCO MJ & STAHL S. 2005. Branch and Bound applications in combinatorial data analysis. New York: Springer-Verlag.

[4] [4] CANO J, CORDÓN O, HERRERA F & SÁNCHEZ L. 2002. A GRASP algorithm for clustering. Lecture Notes in Computer Science Springer, 214-223.

[5] [5] DANON L, DUCH J, ARENAS A & DÍAZ-GUILERA A. 2005. Comparing community structure identification. Statistical mechanics, 9008, 09008.

[6] [6] DUDA RO. 2001. Pattern classification. John Wiley & Sons.

[7] [7] FEO TA & RESENDE MGC. 1995. Greedy randomized adaptive search procedures. Global Optimization, 6: 109-133.

[8] [8] HANSEN P & DELATTRE M. 1978. Complete-link cluster analysis by graph coloring. American Statistical Association, 73: 362, 397-403.

[9] [9] HANSEN P & JAUMARD B. 1997. Cluster analysis and mathematical programming. Mathematical Programming, 79: 191-215.

[10] [10] HANSEN P & MLADENOVIC N. 2001. J-Means: a new local search heuristic for minimum sum of squares clustering. Pattern Recognition, 34: 405-413.

[11] [11] HIGHAM DJ, KALNA G & VASS JK. 2007. Spectral analysis of two-signed microarray expression data. Mathematical Medicine and Biology, 24: 131-148.

[12] [12] HUTTERNHOWER C, FLAMHOLZ AI, LANDIS JN, SAHI S, MYERS CL, OLSZEWSKI KL, HIBBS MA, SIEMERS NO, TROYANSKAYA OG & COLLER HA. 2007. Nearest neighbor networks: clustering expression data based on gene neighborhoods. BMC Bioinformatics, 8: 250.

[13] [13] IBM ILOG. 2010. CPLEX 12.2.0 reference manual.

[14] [14] IHAKA R & GENTLEMAN R. 1993. R Development Core Team. GNU General Public License.

[15] [15] JAIN AK & DUBED RC. 1988. Algorithms for Clustering Data. Prentice-Hall advanced reference series. Prentice-Hall, Inc., Upper Saddle River, NJ.

[16] [16] JAIN AK, MURTY MN & FLYNN PJ. 1999. Data clustering: a review. ACM Computing Sureys, ACM Press, 31: 264-323.

[17] [17] KAWAJI H, TAKENAKA Y & MATSUDA H. 2004. Graph-based clustering for finding distant relationships in a large set of protein sequences. Bioinformatics, 20(2): 243-252.

[18] [18] KRAUSE A, STOYE J & VINGRON M. 2005. Large scale hierarchical clustering of protein sequences. BMC Bioinformatics, 6: 15.

[19] [19] LANCICHINETTI A & FORTUNATO S. 2009. Community detection algorithms: a comparative analysis. Physical Review A, 80: 056117-[11 pages]

[20] [20] MARINAKIS Y, MARINAKI M, DOUMPOS M, MATSATSINIS N & ZOPOUNIDIS C. 2008. A hybrid stochastic genetic-GRASP algorithm for clustering analysis. Operational Research, 8: 22-46.

[21] [21] MINGOTI SA. 2007. Análise de dados através de métodos de estatística multivariada, uma abordagem aplicada. Belo Horizonte: UFMG.

[22] [22] NASCIMENTO MCV. 2010. Metaheurísticas para o problema de agrupamento de dados em grafo. Tese de Doutorado, Universidade de Săo Paulo.

[23] [23] NASCIMENTO MCV, TOLEDO FMB & CARVALHO ACPLF. 2010. Investigation of a new GRASP-based clustering algorithm applied to biological data. Computers & Operations Research, 37: 1381-1388.

[24] [24] RAO MR. 1971. Cluster analysis and mathematical programming. Journal of the American Statistical Association, 66: 622-626.

[25] [25] RESENDE MGC & RIBEIRO CC. 2010. Greedy randomized adaptive search procedures: Advances, hybridizations, and applications. In: M. Gendreau and J.-Y. Potvin, editors, Handbook of Metaheuristics. Springer-Verlag, 2^nd edition.

[26] [26] ROMANOWSKI CJ, NAGI R & SUDIT M. 2006. Data mining in an engineering design environment: or applications from graph matching. Computers & OR, 33: 3150-3160.

[27] [27] ROMESBURG HC. 2004. Cluster Analysis for Researchers. Lulu Pres (re-print).

[28] [28] WU Z & LEAHY R. 1993. An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15: 1101-1113.