A cophenetic correlation coefficient for Tocher’s method

The objective of this work was to propose a way of using the Tocher’s method of clustering to obtain a matrix similar to the cophenetic one obtained for hierarchical methods, which would allow the calculation of a cophenetic correlation. To illustrate the obtention of the proposed cophenetic matrix, we used two dissimilarity matrices – one obtained with the generalized squared Mahalanobis distance and the other with the Euclidean distance – between 17 garlic cultivars, based on six morphological characters. Basically, the proposal for obtaining the cophenetic matrix was to use the average distances within and between clusters, after performing the clustering. A function in R language was proposed to compute the cophenetic matrix for Tocher’s method. The empirical distribution of this correlation coefficient was briefly studied. For both dissimilarity measures, the values of cophenetic correlation obtained for the Tocher’s method were higher than those obtained with the hierarchical methods (Ward’s algorithm and average linkage – UPGMA). Comparisons between the clustering made with the agglomerative hierarchical methods and with the Tocher’s method can be performed using a criterion in common: the correlation between matrices of original and cophenetic distances.


Introduction
Tocher's optimization method (Rao, 1952) allows establishing mutually exclusive clusters of objects according to an objective function that adopts the criterion of optimization, which minimizes the average distance intra-cluster and maximizes the average distance inter-cluster.This method has been used in studies involving quantification of the genetic variability between individuals, both in plants (Gouvêa et al., 2010;Rajamanickam & Rajmohan, 2010;Gorji & Zolnoori, 2011;Leão et al., 2011;Matsuo et al., 2012) and animals (Barbosa et al., 2005).Descriptions for the clustering process with Tocher's method can be found in the work of Sharma (2006) and Cruz et al. (2011).
In clustering studies, it is advisable to perform a consistency evaluation, so that conclusions about similarities between individuals are reliable.In clustering with hierarchical algorithms, the correlation between the elements of original dissimilarity matrix and their respective elements from matrix produced by phenogram -the cophenetic matrix -is taken as an evaluation measure of clustering consistency.This measure is known as cophenetic correlation coefficient, proposed by Sokal & Rohlf (1962), and it is available in most statistical computer packages.Since then, comparisons between clustering results have been performed with the cophenetic correlation (Kopp et al., 2007;Gonçalves et al., 2008;Cargnelutti Filho et al., 2010;Cargnelutti Filho & Guadagnin, 2011).This is due to the fact that the process of construction of phenograms allows calculating a cophenetic matrix.
However, the Tocher's method does not involve a construction of a phenogram to perform the clustering.Thus, the clustering consistency has been evaluated indirectly, based on observation of the results of hierarchical clustering and other multivariate methods (Bertan et al., 2006;Leal et al., 2008;Silva, 2012), including ordering techniques which, sometimes, became impractical due to the excessive number of variables and objects.The fact is that the application of some multivariate methods, such as discriminant analysis, requires at least that the classificatory variables are numerical, unlike evaluation by cophenetic correlation, which needs only the clustering result.
Therefore, this work follow the premise of Sneath & Sokal (1973), in which cophenetic values can be obtained even by ordering methods, The objective of this work was to propose a way of using the Tocher's method of clustering to obtain a matrix similar to the cophenetic one obtained for hierarchical methods, which would allow the calculation of a cophenetic correlation.

Materials and Methods
Tocher's method operates on dissimilarity (or similarity) matrix between individuals.To illustrate the obtaining of the proposed cophenetic matrix, two dissimilarity matrices (Table 1) were used, the first one was obtained by the generalized squared Mahalanobis distance (D 2 ), and the other by the Euclidian distance between 17 garlic cultivars, based on six morphological characters, extracted from Silva (2012).
As in the hierarchical methods, the cophenetic matrix consists of the cophenetic distances, i.e., the fusion level of entities; the proposal for Tocher's method is to get the cophenetic matrix from the average distances within and between clusters.
The average distance within k-th cluster is obtained by averaging the distances pairs of individuals within cluster, according to the following expression: in which: n k is the number of individuals in the k-th cluster; and d i,j is the distance between the individuals i and j allocated in the k-th cluster.Obviously, The average distance between the k-th and the k'-th cluster is obtained by averaging the distances between crossed pairs of individuals from two clusters involved, according to the following equation: in which: n k and n k' are, respectively, the number of individuals in the k-th and k'-th clusters; and d i,j is the distance between the i-th individual from the k-th cluster, and the j-th individual from the k'-th cluster.
Being g the number of clusters formed by Tocher's method, it can be seen that the actual number of distances involved in the construction of the cophenetic matrix is only a function of the number of formed clusters, expressed by g(g + 1)/2.This fact implies that the calculations involved to obtain that matrix can be similarly extended to the modified Tocher's method, proposed by Vasconcelos et al. (2007).Therefore, it is noteworthy that the construction of this matrix depends directly on the number of clusters formed by the method.
For the example used as illustration, diagrams of clusters were designed to represent the average distance relationships within and between clusters (Figure 1).
Matrices containing the average distances within clusters on the main diagonal and average distances between clusters off-diagonal were constructed to facilitate obtaining the cophenetic matrix.
After constructing the cophenetic matrices, the correlations between the elements from each matrix of original distances with the respective elements from cophenetic matrix were calculated, according to the expression: in which: c ij and d ij are, respectively, the element of the i-th row and j-th column of the cophenetic and original distance matrix; and n is the number of individuals (n = 17 in this case).
The Mantel's randomization test was applied, based on ten thousand permutations of rows and columns of the cophenetic matrix, in order to test the hypothesis of null correlation between the cophenetic matrix and the original distance matrix, and also to allow the visualization of the empirical distribution of this correlation coefficient.
To compare results, clusterings were performed using two hierarchical methods: Ward's algorithm and average linkage (UPGMA).The cophenetic correlations for these methods were also calculated, as well as the Mantel´s test.
The distance matrices used in this work and the Tocher's clustering were obtained with the multivariate analysis module of Genes software version 2009.7.0 (Cruz, 2006).After the calculation of distance matrices, the application of hierarchical methods was performed with the hclust() function from "stats" package of R software, and the cophenetic matrices for these methods were obtained by the cophenetic() function, also from "stats" package; the Mantel´s test was performed with the mantel.rtest()function from "ade4" package (Dray & Dufour, 2007), all packages were from the version 2.15.2 R Core Team (R Foundation for Statistical Computing, Vienna, AT).
Studies of genetic divergence often have a large number of individuals to be clustered.Thus, the work necessary to obtain the proposed cophenetic matrix would become exhaustive.With that in mind, a function in R language was written to compute the cophenetic matrix for Tocher's method, requiring only the following inputs: the matrix of average distances within cluster (main diagonal) and between clusters (off-diagonal), the individuals ordered per cluster, and the number of individuals into each cluster.The function was used to obtain the cophenetic matrices according to two performed clusters.Here is the proposed R function:
id.cluster -> vector (numeric) for identification of objects.
Details To define id.cluster, the number 1 must be the lowest value and n (the number of objects) the highest.For example, the first 4 numbers (let us say 12, 28, 3 and 15) refer to the objects of the first cluster, the next 2 numbers (let us say 10 and 1) refer to the second cluster, and so on.
Value be seen that, with Mahalanobis distance, the minimum distance between two cultivars in the corresponding cophenetic matrix was 1.74, which equals the distance within cluster 1.Thus, that is the distance between any two cultivars allocated in cluster 1.The greatest distance (11.78) corresponded to the average distance between clusters 3 and 5.In the obtained cophenetic matrix based on clustering with the Euclidean distance, the shortest distance between two cultivars was 1.60 (cultivars 3 and 5), corresponding to the average distance within cluster 3. The greatest distance (6.14) corresponded to the average distance between clusters 2 and 6.
It is important to note that 136 measures of distance were provided by the matrix of the original distances.The construction of each proposed cophenetic matrix involved, actually, 21 measures of distance: 6 within and 15 between clusters.This number is higher than the one of fusion levels obtained with the hierarchical methods, which was 16 for both.
The obtained cophenetic distances using Tocher's method reliably synthesized the original distances (Figure 2), with an evident higher linear association than cophenetic distances obtained with hierarchical methods, for both dissimilarity measures used.Ward's algorithm showed a weak linear association, which is an expected result because the method tends to show high values for the last fusion levels, and the correlation coefficient is sensitive to outliers.The cophenetic correlation coefficient was calculated on each of the ten thousand permutations performed in the cophenetic matrices obtained with each of the clustering methods and dissimilarity measures.Figure 3 shows the empirical distribution for these coefficients.
The correlations obtained with Tocher's method -0.90

Figure 2 .
Figure 2. Shepard diagram for association between original and cophenetic distances based on: A, the generalized squared Mahalanobis distance; and B, the Euclidean distance.

Figure 3 .
Figure 3. Kernel density of cophenetic correlation based on ten thousand permutations, obtained with: A, the generalized squared Mahalanobis distance; and B, the Euclidean distance.

Table 1 .
Dissimilarity matrix between 17 garlic cultivars, based on the generalized squared Mahalanobis distance (upper triangular matrix) and the Euclidean distance (lower triangular matrix).

Table 2 .
Matrix of cophenetic distances between 17 garlic cultivars, obtained by the Tocher's method based on the generalized squared Mahalanobis distance (upper triangular matrix) and Euclidean distance (lower triangular matrix).