Combinations of distance measures and clustering algorithms in pepper germplasm characterization

Hortic. bras., Brasília, v.37, n.2, April-June 2019 C is a highly diversified genus, in which sweet and chili peppers are inserted, being widely cult ivated both in tropical and subtropical regions. This genus is a vegetable of great economic importance, mainly due to versatility in cuisine, industry, pharmacy and ornamental use. Besides being segmented and diverse, Capsicum genus has a great variety of products and by-products, uses and forms of consumption (Sudré et al., 2010; Cardoso et al., 2018). According to FAOSTAT (2016), the production of fresh and dehydrated sweet and ABSTRACT

chili peppers was estimated in over 38 million tons, in a total cultivated area of 3.7 million ha. So far, 38 species of Capsicum were described (USA-ARS, 2015), in which only five are cultivated for commercial purposes: C. annuum, C. frutescens, C. chinense, C. pubescens and C. baccatum.
With increasing extinction risks and loss of genetic variability, centers for plant genetic resource conservation (CPGRC) have been established worldwide. These CPGRC can be conserved as seed and pollen collections, in the field and in vitro, constituting what is called germplasm bank (Engels & Visser, 2003). CPGRV conserved in germplasm banks include newly breeding and obsolete cultivars, local varieties, breeding lines obtained as intermediate products and genetic stocks, such as gene, chromosomal, and genomic mutants and wild relative (Ríos, 2015).
Many useful traits such as nutritional quality, resistance and/or tolerance to biotic and abiotic stresses are found among the accessions conserved in the germplasm bank. However, characterization and evaluation of these accessions are essential aiming to make them useful, in order to contribute to agricultural productivity (Dulloo et al., 2013). Characterization and evaluation of germplasm can be obtained through agronomic, morphological, cytological, biochemical and molecular information, in which numeric and categorical measurements are frequently involved and, in many cases, types of different variables combinations (Gonçalves et al., 2008;Sudré et al., 2010).
Different studies of Capsicum spp. characterization were carried out (Signorini et al., 2013;Araújo et al., 2018;Cardoso et al., 2018;Moreira et al., 2018). Nevertheless, the generation of a large number of data from different categories may be a factor which makes it difficult to analyze and interpret the results, resulting frequently in an incomplete distinction of the accessions (Oliveira et al., 2016). Thus, a joint analysis of variables may provide a more complete indicator of the variability in germplasm banks. Few studies have used this strategy mainly due to the lack of knowledge of which statistics techniques allow this approach, in addition to the tendency of researchers to give more importance to those variables which are directly related to traits to be improved in a breeding program (Gonçalves et al., 2008;Moura et al., 2010).
Gower (1971) proposed a joint similarity measure of variables, being widely adopted in several studies on characterization and evaluation of germplasm of different species (Gonçalves et al., 2008;Moura et al., 2010;Brandão et al., 2013;Kyriakopoulou et al., 2014;Abid et al., 2015;Oliveira et al., 2016). Another way to study the variables together is to combine specific measures for quantitative and qualitative variables using a pre-determined weight. Sarkar et al. (2015) proposed a mix of six measures of combined distance, considering three for quantitative data (a1: average of the range-standardized absolute difference, a2: Pearson correlation and a3: scaling based on standard score) and two for qualitative data (b1: standardized simple coincidence and b2: distance based on the average absolute difference). The authors verified that combined distance a1b2 using k-means clustering method was the one which presented better allocation of the evaluated rice accessions.
Clustering methods which are usually used for RGV are the agglomerative hierarchical clustering UPGMA and Ward, and non-hierarchical analysis of k-means (Mohammadi & Prasanna, 2003;Crossa & Franco, 2004). Agglomerative hierarchical clusterings consist of considering that each individual is considered an individual cluster. At each step of the algorithm, the individuals are clustered, forming new clusterings until the moment when all the considered individuals will be in a single group. K-means method partitions n individuals into k groups in which each individual belongs to the group closest to the average (Mingoti, 2005).
One of the main advantages of k-means method in relation to the hierarchical methods is the possibility of a pattern changes clustering with algorithm evolution. However, the disadvantage is that the number of clusterings has to be chosen a priori, which may infer in misinterpretations about data structure if the number of clusters is not optimal.
In agglomerative hierarchical clusterings the definition of the best method is often performed by the co-phenotype correlation coefficient (CCC) based on Pearson's correlation. However, CCC may not always be a reliable measure of distortions generated by algorithms (Mérigot et al., 2010;Carteron et al., 2012). Thus, Mérigot et al. (2010) proposed a methodology based on a norm matrix between dissimilarity matrices (D) and clustering (U). One norm allows to define one distance between D and U which verifies general properties of non-negativity, symmetry, and certainty.
This study aims to evaluate different clustering techniques for characterizing and evaluating Capsicum spp. accessions using combinations of specific measures for quantitative and qualitative variables. The joint analysis of these variables can be considered one strategy for an accurate evaluation and knowledge of variability of species in germplasm banks.
The accessions were characterized and evaluated based on morphological and agronomic descriptors proposed by Bioversity International (htpp:// www.bioversityinternational.org) for Capsicum spp. For morphoagronomic characterization, the experiment was carried out in the municipality of Campos dos Goytacazes, Rio de Janeiro (21º45'S, 41º18'W).
For combined analysis of distances (quantitative and qualitative), six distance measures for quantitative data were considered, such as: i) Distance based on the average of the range-standardized absolute difference (Gower): where x ik and x jk are i th and j th accessions of k th quantitative variables; r k ranking of k th variables; and p is the total number of quantitative variables (Gower, 1971).
ii) Distance based on Pearson correlation: where r ij is the correlation product (similarity) between i th and j th accessions, so dissimilarity = 1-similarity.
iii) Kulczynski distance: where x ij and x ik are i th and j th accessions; iv) Canberra distance: where x ik and x jk are i th and j th accessions of k th quantitative variables; and p is the total number of quantitative variables.
v) Bray-Curtis distance: where x ik and x jk are i th and j th accessions of k th quantitative variables; and p is the total number of quantitative variables.
vi) Morisita distance: where x ik and x jk are i th and j th accessions of k th quantitative variables; p is the total number of quantitative variables, e . e For qualitative data, the distance based on simple coincidence was used: Where dk = 0 if yik = yjk, else dk = 1 (Gower, 1971).
The amplitude of the six matrix elements of quantitative distance (A 1 -A 6 ) and qualitative distance (B1) is between 0 and 1. Thus, combination of several distance matrices was calculated with the sum of the distance corresponding to qualitative and quantitative data, such as: Where (a 1ij ), (a 2ij ), (a 3ij ), (a 4ij ), (a 5ij ), (a 6ij ) and (b 1ij ) represent the ij th matrix elements A 1 , A 2 , A 3 , A 4 , A 5 , A 6 and B 1 , respectively. These combined matrices were correlated using Mantel test (1000 permutations).
Capsicum spp. accessions were clustered using different agglomerative hierarchical clusterings (Ward, the nearest neighbor method, the farthest neighbor method, UPGMA and WPGMA). Afterwards, we used cophenetic correlation coefficient (based on Pearson correlation) between combined distance matrices with grouping matrix and 2-norm analysis (Mérigot et al., 2010).

RESULTS AND DISCUSSION
In the correlation of combined distance matrices, we noticed a high association, considering that all of them were significant at 1% probability using Mantel test (Table 1). The highest values of correlation (0.98) were observed between A1B1 x A4B1, A2B1 x A6B1, and A3B1 x A5B1, whereas the lowest value observed (0.77) was between A1B1 x A2B1.
The high correlation between combined distances is due to different factors, like similarity between some distances from quantitative data, such as, Canberra, Bray-Curtis and Gower.
The difference between Bray-Curtis and Canberra is the sum of distances ij, considering that in Bray-Curtis the sum is inside the fraction, whereas in Canberra, it is out of the fraction. In relation to Gower distance, the difference is the denominator, being determined by the amplitude of the accessions studied in a certain variable k, whereas for Bray-Curtis and Canberra this denominator is the sum of i and j for variable k. In relation to Pearson, Morisita and Kulczynski distances, a greater dissimilarity between them and in comparison to Canberra, Bray-Curtis and Gower is observed. Only Pearson combined distance/Simple Coincidence (A2B1) obtained correlation inferior to 0.9, when associated with the other combined distances (A1B1 x A2B1 = 0.77; A2B1 x A3B1 = 0.88; A2B1 x A4B1 = 0.84 and A2B1 x A5B1 = 0.85) ( Table 1).
For most studies of plant germplasm characterization, using joint analyses of quantitative and qualitative data, Gower distance (A1B1) is widely used (Gonçalves et al., 2008;Adewale et al., 2012;Sartie et al., 2012;Silva et al., 2015). However, other combinations can be used aiming to define more reliably dissimilarity/similarity among accessions.
Evaluating cophenetic correlation c o e f f i c i e n t ( C C C ) b e t w e e n agglomerative hierarchical clustering and combined distance matrices, UPGMA clustering obtained the highest values, ranging from 0.77 (A6B1) to 0.84 (A4B1) ( Table 2). The lowest values were verified for Ward clustering which ranged from 0.60 (A2B1) to 0.76 (A1B1). According to Sokal & Rohlf (1962), values 0.9≥CCC show a very good adjustment, 0.8≤CCC<0.9 good adjustment, 0.7≤CCC<0.8 a bad adjustment and <0.7 very bad adjustment. Using this classification in the obtained results, we observed that the majority of the values obtained by UPGMA method showed a good adjustment between clustering and dissimilarity matrices. Mérigot et al. (2010) have raised three criticisms about the reliability of the information obtained from CCC analysis: i) is a measure of intensity of monotonic relationship between dissimilarity (D) and clustering matrices (U); ii) is sensitive to extreme values; and iii) the CCC close to 1 shows a perfect correspondence between D and U, whereas the correspondence between the two matrices may indeed be weak. However, when 2-norm analysis was performed, lower values of UPGMA clustering and higher values for Ward clustering were observed, showing an agreement between CCC and 2-norm methods (Table 2). Carteron et al. (2012), studying the comparison of 15 distance measures and seven agglomerative hierarchical clusterings, observed that 2-norm analysis and CCC were not in accordance with efficiency of the clustering algorithm, considering that CCC did not provide any clear indication of the efficiency of clustering algorithms. Using 2-norm analysis, UPGMA was the most efficient algorithm, whereas Ward was the least efficient (Table 2).
Despite the high difference of values observed in the 2-norm analysis between UPGMA and Ward clusterings, just little distortion between UPGMA and Ward clusterings in the Gower distance (A1B1) was observed (Figure 1). In UPGMA clustering, based on the seven criteria  (frey, pseudot2, dunn, mcclain, cindex, cc and silhouette), six was the optimal number of clusters observed, being groups I and II formed by C. chinense accessions, group III was formed by C. baccatum accessions, group IV was formed by C. frutescens accessions and groups V and VI were formed by C. annuum accessions. In relation to Ward clustering analysis, separation of species into groups was also verified, except for group I which was formed by four C. annuum accessions and six C. frutescens accessions.
Comparing different combinations of distances, using UPGMA clustering, we observed that Gower distance showed better adjustment for group formation, that means, separation of species when compared with the others (Figures 2 and 3). We also observed better adjustment of groups formed by Gower distance when groups were validated by program fpc (Flexible Procedures for Clustering -R program). Genus Capsicum species are distributed in three distinct gene complexes based on crossability. Annuum complex consists of C. annuum, C. chinense and C. frutescens. These species are integrated by morphological characteristics, derived from wild relatives of different species; they are potentially easily crossed (Onus & Pickersgill, 2004) and they have the capacity to produce interspecific hybrids (Hill et al., 2013); Baccatum complex consists of C. baccatum var. baccatum, C. baccatum var. pratermissum and C. baccatum var. pendulum; and pubescens complex consists of some wild species and only one cultivated species, C. pubescens. Thus, most of combined distances using UPGMA method allowed the separation of Capsicum Combinations of distance measures and clustering algorithms in pepper germplasm characterization in determining genetic variability and divergence among the evaluated accessions with the generation of more accurate and more complete information.
when joint analyses of quantitative and qualitative variables of the descriptors proposed by Bioversity International for Capsicum spp. was used. Data obtained in this study show a viable alternative species, with greater efficiency in maximizing the dissimilarities between annuum and baccatum complexes. However, we did not observe any correct separation of complexes of the genus