## Services on Demand

## Article

## Indicators

## Related links

## Share

## Genetics and Molecular Biology

*Print version* ISSN 1415-4757

### Genet. Mol. Biol. vol.22 n.3 São Paulo Sept. 1999

#### http://dx.doi.org/10.1590/S1415-47571999000300024

**Comparison of similarity coefficients based on RAPD markers in the common bean***

**Jair Moura Duarte, João Bosco dos Santos and Leonardo Cunha Melo**

*Part of a thesis presented by J.M.D. to the Universidade Federal de Lavras, Lavras, MG, in partial fulfillment of the requirements for the Master's degree. Departamento de Biologia, Universidade Federal de Lavras, Caixa Postal 37, 37200-000 Lavras, MG, Brasil. Send correspondence to J.B.S. E-mail: jbsantos@ufla.br

ABSTRACT

The alterations caused by eight different similarity coefficients were evaluated in the clustering and ordination of 27 common bean (*Phaseolus vulgaris* L.) cultivars analyzed by RAPD markers. The Anderberg, simple matching, Rogers and Tanimoto, Russel and Rao, Ochiai, Jaccard, Sorensen-Dice, and Ochiai II's coefficients were tested. Comparisons among the coefficients were made through correlation analysis of genetic distances obtained by the complement of these coefficients, dendrogram evaluation (visual inspection and consensus fork index - CI_{C}), projection efficiency in a two-dimensional space, and groups formed by Tocher's optimization procedure. The employment of different similarity coefficients caused few alterations in cultivar classification, since correlations among genetic distances were larger than 0.86. Nevertheless, the different similarity coefficients altered the projection efficiency in a two-dimensional space and formed different numbers of groups by Tocher's optimization procedure. Among these coefficients, Russel and Rao's was the most discordant and the Sorensen-Dice was considered the most adequate due to a higher projection efficiency in a two-dimensional space. Even though few structural changes were suggested in the most different groups, these coefficients altered some relationships between cultivars with high genetic similarity.

INTRODUCTION

Studies of divergence and phylogenetic relationships between and within vegetable species of agricultural interest have been one of the most concrete contributions of molecular markers to germplasm organization, plant genetics and breeding. Multivariate techniques such as clustering and ordination analyses for a simplified representation of the results are frequently employed in these studies. The predecessor of these analyses is the construction of a similarity (or distance) matrix between the cultivars being evaluated.

Jackson *et al.* (1989) commented that employment of these techniques has revealed some problems. The objective nature of the analyses is compromised by the subjective choice of the clustering method and/or the similarity-dissimilarity coefficient.

Several coefficients have been proposed (Sokal and Sneath, 1963; Sneath and Sokal, 1973; Johnson and Wichern, 1988). Similarity coefficients specific for dichotomic variables, especially co-occurrence measures, are suggested for use with RAPD type molecular markers. These coefficients employ several reasons of similarity or differences by total comparisons, and their values vary from 0 to 1 (Skroch *et al.*, 1992). Though many coefficients are available, published studies usually do not left their preference for any one in particular. Considering that clustering and ordination results can be influenced by this choice (Gower and Legendre, 1986; Jackson *et al*., 1989), these coefficients need to be better understood, so that the most efficient ones can be employed.

In this study, the alterations caused by eight different similarity coefficients on the subsequent clustering and ordination analyses of 27 common bean (*Phaseolus vulgaris *L.) cultivars analyzed by RAPD markers were evaluated. The most adequate coefficient was identified for the study of genetic divergence in these cultivars.

MATERIAL AND METHODS

Similarity coefficients were compared among 27 common bean cultivars (Table I) analyzed by RAPD markers. Procedures for DNA extraction, RAPD reaction and electrophoresis were essentially as described by Nienhuis *et al*. (1995).

From a zero and one matrix constructed by 137 medium/strong RAPD bands, where zero represented an absence of the band and one the presence, genetic similarity estimates (*sg _{ij}*) between each pair of

*i*and

*j*cultivars were performed for eight similarity coefficients (Table II). Similarities derived from these coefficients were transformed into genetic distance measures by the following equation:

*dg*= 1 -

_{ij}*sg*. All the genetic similarity matrices met the presuppositions for transformation into genetic distances described by Johnson and Wichern (1988), that is, all of them were non-negative definite. Similarity analyses were done with the NTSYS-PC program (Rohlf, 1992).

_{ij}

Coefficients were compared by Spearman's correlation between the genetic distances generated by the complement of these coefficients, and also by the evaluation of alterations caused by these different coefficients in the subsequent clustering analyses (construction of dendrograms and groups formed by Tocher's optimization procedure, cited by Rao, 1952) and ordination analyses (two-dimensional projection (Cruz and Viana, 1994)).

The unweighted pair-group mean arithmetic method (UPGMA) was employed to construct the dendrograms. Each cultivar was denominated an operational taxonomic unit (OTU). The different dendrograms were subjectively compared using visual inspection, and then contrasted with consensus trees using the CI_{C} index or consensus fork index, obtained from comparisons of all pair of dendrogram combinations (Rohlf, 1982).

The CI_{C} index gives a relative estimate of dendrogram similarity. It is obtained by dividing the number of common ramifications between the dendrograms by the maximum possible number of ramifications, which is *n*-2 for integrally resolved dendrograms (*n *corresponds to the number of OTU) (Rohlf, 1982). Dendrograms were obtained from the `SANH-*Clustering*' option and the CI_{C} index by the `CONSENSUS-*Consensus tree'* option, both in the NTSYS-PC program (Rohlf, 1992).

The methodology of Cruz and Viana (1994) was employed, from the GENES program (Cruz, 1997), for the projection of distances in a two-dimensional space. Similarity coefficients were compared by the efficiency of the projection considering:

a) Correlation between the original distances and the distances obtained by the graphic representation of two-dimensional dispersion;

b) Distortion degree (1 - a), considering that:

in which *d _{gij}* and

*d*are the graph distances (two-dimensional space) and original distances (n-dimensional space), respectively, of every pair of i and j cultivars (Cruz and Viana, 1994).

_{oij}c) Stress (*s*) value, given by:

This statistical representation of stress (standardized residual sum of squares) was proposed by Kruskal (1964). It is a parameter that determines the goodness-of-fit of the graphic projection. Stress was classified according to the following suggestions (Kruskal, 1964): \

The establishment of groups by Tocher's optimization procedure was obtained using the GENES program (Cruz, 1997). The largest value of the set of smaller distances involving each cultivar studied was considered as the inter-group distance limit.

Levels of statistical significance are not given because the analyses are derived from a single initial data matrix and therefore lack independence.

RESULTS AND DISCUSSION

Correlations between the different genetic distances were all close to 1 (Table III), making it evident that they are highly related. Even though all these correlations were elevated, for the Russel and Rao's coefficient they were slightly inferior than for the other coefficients. These high distance correlations seem to be constant for the different coefficients applied to dichotomic variables. Johns *et al*. (1997), in a study with RAPD markers in the common bean, found correlations on the order of 0.989, 0.972 and 0.979 between the genetic distances obtained by the complement of the simple matching coefficient, Jaccard and Nei-Li's coefficients and Rogers' modified distance, respectively.

The dendrograms constructed from the coefficients studied all presented the same general structure (Figure 1), making it evident that the different coefficients caused few alterations. Considering that the 27 common bean cultivars belonged to two distinct domestication centers and different races, one can perceive that all the dendrograms were capable of dividing the cultivars into their respective domestication centers. However, some modifications in the clustering of races could be found. These results are in agreement with those obtained by Johns *et al*. (1997), who verified that different similarity coefficients basically did not influence the clustering of common bean landraces from Chile in groups corresponding to the Mesoamerican and Andean domestication centers.

Although all dendrograms were similar, when they were contrasted by the CI_{C} index (Table IV), small differences among them were made evident. By this index, whose amplitude goes from 0 to 1, two dendrograms are considered identical when the calculated value equals one. Therefore, the dendrogram in Figure 1 obtained by Jaccard's similarity coefficient is identical to that of Sorensen-Dice, as were Rogers and Tanimoto's and Ochiai II's. Comparing dendrograms by this index, one can also perceive their division into two groups, based on their similarity: the first corresponded to those constructed by simple matching, Rogers and Tanimoto, Ochiai and Ochiai II's coefficients. The other group involved Anderberg, Jaccard and Sorensen-Dice's coefficients. It was also observed that the dendrogram constructed by the Russel and Rao's coefficient presented very low CI_{C} index values compared to the other coefficients, making it evident that this coefficient is the most discriminating, as a visual evaluation of this dendrogram (Figure 1) shows. These results are highly coherent with those presented by Jackson *et al*. (1989), who studying relationships between different fish species based on different similarity coefficients, verified that cluster analysis shows a strong similarity between dendrograms obtained with Jaccard and Sorensen-Dice's coefficients, and simple matching and Rogers and Tanimoto's coefficients.

The similar appearance in some dendrograms is not surprising since generalizations about the properties of several coefficients are possible. They are differentiated by the manner in which the matrix of original data (1 = presence of the RAPD marker and 0 = absence) is employed in the similarity estimate. When two genotypes are compared, the following situations occur: *a* = 1.1; *b* = 1.0; *c* = 0.1; *d* = 0.0. Thus, Jaccard and Sorensen-Dice's coefficients are equivalent, except that double weight is given to positive co-occurrences (*a*) in the Sorensen-Dice's coefficient. Simple matching and Rogers and Tanimoto's coefficients include negative co-occurrences (*d*), but differ by the double weight given to the disagreements (that is, *b* and *c*) in the latter coefficient. As shown by the results presented, different weights of values of *a, b, c* and* d* seem to have limited impact on the subsequent analyses.

The different similarity coefficients altered the efficiency of distance projection in a two-dimensional space (Table V). Considering the three evaluation parameters of efficiency separately (distortion, correlation between original and estimated distances and stress), one can perceive the same general tendency of coefficient classification. The distorted values are coherent with the correlation values, and both values are coherent with the level of stress. Stress values are the most widely used parameter to evaluate projection efficiency. The Ochiai's coefficient showed the smallest stress value and Russel and Rao's the biggest. According to Kruskal (1964), simple matching, Sorensen-Dice and Ochiai's coefficients had good levels of stress. Rogers and Tanimoto, Anderberg, Jaccard and Ochiai II's coefficients had regular, and only the Russel and Rao's coefficient had stress considered unsatisfactory.

One cultivar clustering method that has also been employed with RAPD data is Tocher's optimization procedure, cited by Rao (1952). In this method, individuals (cultivars) are partitioned into non-empty and mutually exclusive sub-groups by means of maximization or minimization of a pre-established measurement (Cruz and Regazzi, 1994), requiring a similarity or distance matrix, which can be obtained by several coefficients. Different coefficients altered the number of groups formed, which varied from six to 10 (Table VI). They also altered the classification of some cultivars in these groups. Prior results (Table VI) had the same tendency, in which Russel and Rao's similarity coefficient once again was the most discriminatory. Sokal and Sneath (1963) reported that this coefficient is, in essence, a `hybrid' coefficient, excluding negative co-occurrences (*d*) from the numerator, but not from the denominator. This seems to be of questionable usefulness.

All results obtained illustrate the redundancy of the different coefficients. Anderberg, Jaccard and Sorensen-Dice's coefficients had approximately identical results, as did the simple matching and Rogers and Tanimoto's coefficients. Nevertheless, similarity coefficient choice should be based on some criteria, because even a few structural changes of more differentiated groups can alter the relationship between cultivars with high genetic similarity.

In relation to these criteria, an important aspect to be considered is the inclusion or exclusion of negative co-occurrences in the coefficient. This inclusion is highly related to the type of trait with which one is working. In some cases, an absence of the trait in both individuals would indicate similarity, but in other cases, this is not necessarily true. Taking into consideration the genetic basis of RAPD markers (Williams *et al*., 1990), the absence of amplification of a determined band in two genotypes does not necessarily represent genetic similarity between them, which makes those coefficients that exclude these negative co-occurrences from their expression of similarity (Jaccard, Sorensen-Dice, Ochiai, etc.) more adequate for use with this type of marker. Sokal and Sneath (1963) also stated that the simpler the coefficient the easier its interpretation; therefore, simpler coefficients should preferentially be employed. Jaccard's similarity coefficient is the simplest of its category (exclusion of *d*), and it has been widely employed with RAPD markers. In this study, it was verified that cultivar cluster results with Jaccard and Sorensen-Dice's coefficients were identical, but for the latter, a higher projection efficiency in a two-dimensional space (smaller distortion and stress, higher correlation) was obtained, so that the Sorensen-Dice's coefficient can be considered as the most adequate for a genetic divergence study in this group of cultivars, employing RAPD markers.

ACKNOWLEDGMENTS

Research supported by CAPES and FAPEMIG.

RESUMO

Foram avaliadas as alterações provocadas por oito diferentes coeficientes de similaridade no agrupamento de 27 cultivares de feijão analisados por marcadores RAPD. Foram testados os coeficientes de Anderberg, simple matching, Rogers e Tanimoto, Russel e Rao, Ochiai, Jaccard, Sorensen-Dice e Ochiai II, sendo as comparações entre eles realizadas pelas correlações entre as distâncias genéticas obtidas pelo complemento destes coeficientes, e também pela avaliação dos dendrogramas (inspeção visual e índice CI_{C}), eficiência da projeção no espaço bidimensional e grupos formados pelo método de otimização de Tocher. Os resultados evidenciaram que a utilização de diferentes coeficientes de similaridade provocou poucas alterações na classificação dos cultivares em grupos, sendo as correlações obtidas entre as distâncias genéticas maiores que 0,86. Apesar disso, foi observado que diferentes coeficientes alteraram a eficiência da projeção no espaço bidimensional e formaram número diferenciado de grupos pelo método de otimização de Tocher. Dentre estes, o de Russel e Rao apresentou resultados mais discordantes em relação aos demais e o de Sorensen-Dice foi considerado o mais adequado devido a uma maior eficiência de projeção no espaço bidimensional. Mesmo provocando poucas mudanças na estrutura dos grupos mais diferenciados, estes coeficientes alteraram alguns relacionamentos entre cultivares com alta similaridade genética.

REFERENCES

**Anderberg, M.R.** (1973). *Cluster Analysis for Applications*. Academic Press, New York. [ Links ]

**Cruz, C.D.** (1997). Programa Genes: aplicativo computacional em genética e estatística. Universidade Federal de Viçosa, Viçosa. [ Links ]

**Cruz, C.D. **and** Regazzi, A.J.** (1994). *Modelos Biométricos Aplicados ao Melhoramento Genético. *UFV, Viçosa. [ Links ]

**Cruz, C.D. **and** Viana, J.M.S.** (1994). A methodology of genetic divergence analysis based on sample unit projection on two-dimensional space. *Rev. Bras. Genet.* *17*: 69-73. [ Links ]

**Dice, L.R.** (1945). Measures of the amount of ecologic association between species. *Ecology 26*: 297-302. [ Links ]

**Gower, J.C. **and** Legendre, P.** (1986). Metric and Euclidean properties of dissimilarity coefficients. *J. Classif*. *3*: 5-48. [ Links ]

**Jaccard, P.** (1901). Étude comparative de la distribution florale dans une portion des Alpes et des Jura. *Bull. Soc. Vaudoise Sci. Nat. 37*: 547-579. [ Links ]

**Jackson, A.A., Somers, K.M. **and** Harvey, H.H.** (1989). Similarity coefficients: measures for co-occurrence and association or simply measures of occurrence? *Am. Nat.* *133*: 436-453. [ Links ]

**Johns, M.A., Skroch, P.W., Nienhuis, J., Kinrichsen, P., Bascur, G. **and** Muñoz-Schick, C.** (1997).Gene pool classification of common bean landraces from Chile based on RAPD and morphological data. *Crop Sci.* *37*: 605-613. [ Links ]

**Johnson, R.A. **and** Wichern, D.W.** (1988). *Applied Multivariate Statistical Analysis*. Prentice-Hall, New Jersey. [ Links ]

**Kruskal, J.B.** (1964). Multidimensional scaling by optimizing goodness of fit to a non-metric hypothesis. *Psychometrika* *29*: 1-27. [ Links ]

**Nienhuis, J., Tivang, J., Sckroch, P. **and** Santos, J.B. dos** (1995). Genetic relationships among cultivars and lines of lima bean (*Phaseolus lunatus *L.) as measured by RAPD markers. *J. Am. Soc. Hort. Sci. 120*: 300-306. [ Links ]

**Ochiai, A.** (1957). Zoogeographic studies on the soleoid fishes found in Japan and its neighbouring regions. *Bull. Jpn. Soc. Sci. Fish. 22*: 526-530. [ Links ]

**Rao, R.C.** (1952). *Advanced Statistical Methods in Biometric Research*. J. Wiley, New York. [ Links ]

**Rogers, D.J. a**nd** Tanimoto, T.T.** (1960). A computer program for classifying plants. *Science 132*: 1115-1118. [ Links ]

**Rohlf, F.J.** (1982). Consensus indices for comparing classifications. *Math. Biosci. 59*: 131-144. [ Links ]

**Rohlf, F.J.** (1992). *Numerical Taxonomy and Multivariate Analysis System*. *Version 1.70.* Exeter Software, Setauker, NY. [ Links ]

**Russel, P.F. **and **Rao, T.R.** (1940). On habitat and association of species of anopheline larvae in south-eastern Madras. *J. Malaria Inst. India 3*: 153-178. [ Links ]

**Skroch, P., Tivang, J. **and** Nienhuis, J.** (1992). Analysis of genetic relationships using RAPD marker data. In: *Applications of RAPD Technology to Plant Breeding. Joint Plant Breeding Symposia Series*, Minneapolis, 1992. CCSA, ASHS, and AGA, Madison, pp. 26-30. [ Links ]

**Sneath, P.H.A. **and **Sokal, R.R. **(1973).* Numerical Taxonomy: the Principles and Practice of Numerical Classification.* W.H. Freeman, San Francisco. [ Links ]

**Sokal, R.R. **and** Michener, C.D.** (1958). A statistical method for evaluating systematic relationships. *Univ. Kans. Sci. Bull. 38*: 1409-1438. [ Links ]

**Sokal, R.R. **and **Sneath, P.H.A.** (1963). *Principles of Numeric Taxonomy*. W.H. Freeman, San Francisco. [ Links ]

**Sorensen, T.** (1948). A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. *K. Dan. Vidensk. Selsk. Biol. Skr. 5*: 1-34. [ Links ]

**Williams, J.G.K., Kubelik, A.R., Livak, K.J., Rafalski, J.A. **and** Tingey, S.V.** (1990). DNA polymorphisms amplified by arbitrary primers are useful as genetic markers. *Nucleic Acids Res*. *18*: 6531-6535. [ Links ]

**(Received April 6, 1998) **