Acessibilidade / Reportar erro

Comparison between multivariate methods applied for the evaluation of genetic divergence in Cacao (Theobroma cacao L.)

Abstracts

Several multivariate methods have been used in divergence analyses of populations. Consistency and relative association among four methods were assessed using a 5 x 5 complete-diallel data involving cacao cultivars. Over a 5-year period, five cultivars were analyzed based upon five yield components. In assessing the divergence of parents only the data obtained from five cacao cultivars were analyzed. Four multivariate statistics presented close association when considered in pairs, in this case the Mahalanobis' (D²) with the mean Euclidean distance obtained from canonical variates (d cv), and mean Euclidean distance (d e) with the mean Euclidean distance obtained from principal components (d pc). In both cases, high correlations (r > 0.95) were obtained. However, a weak association was detected between D² and de and between d pc and d cv (0.50 and 0.66, respectively). Thus, in studies on genetic divergence, statistics considering the error variance-covariance matrix should be preferred whenever its estimate is possible.

cacao cultivars; canonical variates; genetic divergence; Mahalanobis' and mean Euclidean distances; principal components


Vários métodos multivariados vêm sendo aplicados em análises de divergência de populações. Nesse estudo, a consistência e a concordância relativa entre quatro métodos foram acessadas utilizando-se cinco cultivares de cacau. Os dados analisados referem-se a cinco componentes de produção obtidos de um dialelo completo 5 x 5, durante cinco anos. As quatro estatísticas multivariadas aplicadas apresentaram estreita concordância entre si, quando consideradas aos pares, no caso, as distâncias de Mahalanobis (D2) com as euclidianas médias obtidas de variáveis canônicas (d vc) e as distâncias euclidianas médias (d e) com as euclidianas médias obtidas de componentes principais (d cp). Em ambos os casos, correlações altas foram obtidas (r > 0,95). Todavia, fraca concordância foi detectada entre D2 e de (0,50) e entre d cp e d vc (0,66). Assim, em estudos de divergência genética, as estatísticas que consideram a matriz de variâncias e covariâncias residuais deverão ser preferidas, sempre que for possível a sua estimação


Comparison between multivariate methods applied for the evaluation of genetic divergence in Cacao (Theobroma cacao L.)

Luiz Antônio dos Santos DiasI; Paulo Yoshio KageyamaII

IPresent address - Departamento de Biologia Geral, Universidade Federal de Viçosa, CEP 36.571-000, Viçosa - MG, Brasil

IICentro de Pesquisas do Cacau

Address to correspondence Address to correspondence Luiz Antônio dos Santos Dias Departamento de Biologia Geral, Universidade Federal de Viçosa CEP 36.571-000, Viçosa, MG, Brasil

ABSTRACT

Several multivariate methods have been used in divergence analyses of populations. Consistency and relative association among four methods were assessed using a 5 x 5 complete-diallel data involving cacao cultivars. Over a 5-year period, five cultivars were analyzed based upon five yield components. In assessing the divergence of parents only the data obtained from five cacao cultivars were analyzed. Four multivariate statistics presented close association when considered in pairs, in this case the Mahalanobis' (D2) with the mean Euclidean distance obtained from canonical variates (dcv), and mean Euclidean distance (de) with the mean Euclidean distance obtained from principal components (dpc). In both cases, high correlations (r > 0.95) were obtained. However, a weak association was detected between D2 and de and between dpc and dcv (0.50 and 0.66, respectively). Thus, in studies on genetic divergence, statistics considering the error variance-covariance matrix should be preferred whenever its estimate is possible.

Key words: cacao cultivars, canonical variates, genetic divergence, Mahalanobis' and mean Euclidean distances, principal components

RESUMO

Vários métodos multivariados vêm sendo aplicados em análises de divergência de populações. Nesse estudo, a consistência e a concordância relativa entre quatro métodos foram acessadas utilizando-se cinco cultivares de cacau. Os dados analisados referem-se a cinco componentes de produção obtidos de um dialelo completo 5 x 5, durante cinco anos. As quatro estatísticas multivariadas aplicadas apresentaram estreita concordância entre si, quando consideradas aos pares, no caso, as distâncias de Mahalanobis (D2) com as euclidianas médias obtidas de variáveis canônicas (dvc) e as distâncias euclidianas médias (de) com as euclidianas médias obtidas de componentes principais (dcp). Em ambos os casos, correlações altas foram obtidas (r > 0,95). Todavia, fraca concordância foi detectada entre D2 e de (0,50) e entre dcp e dvc (0,66). Assim, em estudos de divergência genética, as estatísticas que consideram a matriz de variâncias e covariâncias residuais deverão ser preferidas, sempre que for possível a sua estimação.

Introduction

Multivariate methods have proven to be adequate for the evaluation of divergence between parents and for predicting promising crosses between them in several crops (Dias and Kageyama, 1997a). Such methods are also applied for optimizing germplasm collections by evaluating the divergence between accessions (Dias et al. 1997). These applications may be performed on characters traditionally evaluated in cacao trials and do not involve any additional field costs. Dias and Kageyama (1997a) were able to associate the average and heterotic performance realized in hybrids, from genetic divergence estimated among five cacao parent cultivars using D2 of Mahalanobis' distance. Because an optimum environment was shown to be important for better expressing divergence by D2, the genetic divergence may be assessed based on a single favourable year (Dias and Kageyama, 1997b). This strategy has a predictive nature because it prevents making and evaluating hundreds of undesirable crosses. Only the crosses predicted to be promising are made, thus resulting in saving of financial resources, time, and labour.

There are several possible multivariate methods to be performed in the evaluation of divergence between populations (Van Laar, 1991). The most common are the Mahalanobis' and Euclidean distances. The Euclidean distance, however, may be applied to the original and standardized average data, to the scores of the first principal components and to the scores of the first canonical variates. It is of utmost importance that these statistics should be consistent with each other, showing a close association. This consistency and close association arise fundamentally when the same most distant and most similar pairs of cultivars are indicated by the different statistics. When this fact occurs, any of these multivariate statistics can be applied in studies carried out to evaluate divergence. The goal of this study was to compare the four multivariate statistics mentioned in evaluating divergence between five cacao cultivars. The identification of the least potent characters to divergence was also assessed by using both canonical variates and principal components analyses.

Materials and Methods

Plant materials: The present study employed data from cacao cultivars obtained from a 5 x 5 complete-diallel, where five non-commercial cacao cultivars were tested together with their 20 possible hybrids. A detailed combining ability analysis and previous descriptions on this diallel were provided by Dias and Kageyama (1995). In assessing the divergence of parents, only the data obtained from five cacao cultivars (CC 41, SIAL 169, CEPEC 1, ICS 1 and SIC 19) were used, which were evaluated by five yield components - the number of healthy and collected fruits per plant (NHFP and NCFP), the weight of humid seeds per plant in kg (WHSP) and per fruit in g (WHSF), and the percentage of diseased fruits per plant (PDFP), over a 5-year period.

Multivariate analyses: In order to calculate Mahalanobis' distance and canonical variates, the joint analysis of variance and covariance, carried out over the set of five cultivars, was used. Thus, the underlying analysis of variance and covariance structure included the following matrices of sums of squares (SS) and products (SP):

Clique para ampliar

The D2 statistic of Mahalanobis' distance (see Rao, 1952) between two cultivars on p characters is defined as:

where d is a vector of differences between the cultivars averages for all the p characters and d' is its transpose. W is a p x p variance-covariance matrix of pooled error obtained from joint analysis of variance. Canonical variates are linear combinations of the p characters, using coefficient of the eigen-vectors associated with the eigen-values of the determinantal matrix W-1B. B is the estimated between cultivars variance-covariance matrix and W is the estimated pooled error variance-covariance matrix (both matrices were shown above). By definition, the largest eigen-value and the coefficient of the eigen-vector associated with it, produces the first canonical variate, which corresponds to the best linear function. The next-largest eigen-value produces the second best linear function and so forth.

The identification of the least potent characters to divergence was assessed using Singh's (1981) criterion applied to canonical variates. By this criterion the characters of minor importance, rejected by redundancy, were those associated to the largest coefficients of the eigen-vectors corresponding to the smallest eigen-values. These coefficients, however, were standardized by multiplying them by the standard deviation of the corresponding character obtained from pooled error variance and named standardized weights. In respect to the percentage contribution of different characters to the overall divergence measured by D2, Singh's (1981) criterion was also applied. In both the cases, the pooled error variance-covariance matrix W was done equal the identity matrix having unit variance and zero covariance (see Rao, 1952). Thus, one set of original variates Xi was transformed in one set of uncorrelated variables Yi.

Also, the principal component is defined as a linear combination for the p characters, with preservation of the total variance that was redistributed among the components. The first principal component has the largest variance. On the other hand, the second principal component, orthogonal to the first, has the second largest variance and so forth. The utilization of the Lagrange multipliers and symbolic differentiation produces the maximization of function, generating a polynomial of the p degree, with latent roots that are the p eigen-values of the covariance matrix between the characters. As the p characters were measured in different units, the standardization was necessary. Standardization provided homogeneity of variance and allowed every character an equal chance to contribute to the divergence. Characters standardized to unit variance generated a p x p correlation matrix for principal components analysis (PCA). In this case, the total variance was equal to p since the correlation matrix was used for PCA.

Inferences about redundant characters involved in the analysis were made applying Jolliffe's (1972, 1973) criterion in the PCA. By this criterion, the characters rejected by redundancy were those associated with the largest coefficients of the eigen-vectors, corresponding to the smallest eigen-values. The number of characters rejected equals the number of eigenvalues of the correlation matrix less than about 0.70. This limit, due to the fact that too few characters were retained, was not considered in our analysis.

The mean Euclidean distance between two cultivars i and i' in relation to the j character was defined as:

where p was the number of character analyzed and xij= xij/sj being the mean of the j-th character measured in the i-th cultivar standardized by their standard deviation sj. With the standardization and multiplying by 1/p, the Euclidean distance did not alter by the different scales of measurement of the characters nor by their number. Also, xij refered to the scores from the principal components and canonical variates obtained from standardized data and p could be substituted by k to represent the number of principal components or of canonical variates, as in expressions below:

where pcij was the score from i-th cultivar in relation to the j-th principal component obtained from correlation matrix between the characters and

where cvij was the score from i-th cultivar in relation to the j-th canonical variate.

Calculating divergence by multivariate statistics: These four multivariate statistics were used to quantify divergence between cacao cultivar pairs. Such quantification was accomplished by the mean Euclidean distance, applied to the standardized original data, to the scores of the first two principal components, and to the scores of the first two canonical variates (Van Laar, 1991). The mathematical relationship between Euclidean distance and principal components analysis has been stressed by Dias (1998). Both scores were also obtained through standardized original data. Also in quantifying divergence, the Mahalanobis' distance was used (see Rao, 1952). The degree of association between divergence estimates, evaluated by the multivariate statistics, was quantified based on the statistical relationship among them; on the equality test between the pooled error correlation matrix, and the identity matrix (Godoi, 1985); and by the magnitude of Spearman's rank correlation coefficient (r) obtained between the pairs of such statistics.

Results and Discussion

Comparing multivariate statistics: The estimates of Mahalanobis' distance (D2), mean Euclidean distance of standardized data (de) and the estimated mean Euclidean distance of the scores of the first two principal components (dpc) and the first two canonical variates (dcv) refered to the five cacao cultivars to the pairs are shown in Table 1. The first two components and the first two variates accumulated 89.4 and 92.1% respectively, of the total variation (see Table 2). The association among such estimates was first investigated taking into account the data of Table 1. The D2 and dcv statistics identified cultivar pairs 1, 4 and 2, 4 as the most divergent and pairs 1, 5 and 3, 5 as the most similar. Based on de and dpc the most divergent cultivars were 2, 3 and 2, 4, and 1, 5 and 3, 4 the most similar. These results demonstrated a certain discrepancy among the different statistics applied.

The ratio between was another approach for assessing the mean degree of association between statistics dpc and de. In turn, the association between statistics dcv and D2 was quantified by the ratio. Both ratios were calculated from Table 1. Thus, with k = 2 as the number of principal components or canonical variates used in the calculus of distance, and p = 5 as the number of evaluated characters, the statistical relationship between dpc and de provided a degree of association of 89.4%. On the other hand, the ratio between the mean Euclidean distance estimated from canonical variates and the sum of Mahalanobis' distances was, in this case, 92.1%, which demonstrated a higher degree of association between these last two statistics.

The close association among the distance estimates applied was also assessed with the evaluation of the pooled error correlation matrix among the studied characters (Table 2). This correlation matrix was only considered in Mahalanobis' distance and mean Euclidean obtained from canonical variates. Thus, their differential allowed the comparison of Mahalanobis' distance with that of the mean Euclidean distance and the mean Euclidean distance from principal components with that obtained from canonical variates. However, the H0 hypothesis, employed to test the equality between the pooled error correlation matrix and the identity matrix was accepted (c 2 = 14.66, P < 0.05 with 10 d.f), although this matrix presented three correlation estimates of high magnitudes in the 10-correlation set (see Table 2).

The degree of association of the different distance estimates was also evaluated by the correlation between them (Table 1). The correlations were of high magnitudes regarding the Mahalanobis' and Euclidean distances obtained from canonical variates (r = 0.95) and between the mean Euclidean distance and the Euclidean estimated from principal components (r = 0.97). However, the correlations between the estimates of D2 and de, D2 and dpc, de and dcv, and between dpc and dcv were only moderate, which confirmed the discrepancy detected among several statistics used in quantifying the divergence among cultivars. Also Maluf and Ferreira (1983) found a low correlation estimate between the Euclidean and the Mahalanobis' distances (r = 0.27). However, similar divergence patterns between D2 and the Euclidean distance (Maluf et al, 1983), and between D2 and canonical variates (Ramanujam et al, 1974; Narayan and Macefield, 1976; Jain et al, 1981; Das & Das Gupta, 1984) have been reported. On the other hand, Hussaini et al (1977) and Calamassi et al (1988) detected close association in the divergence analysis conducted by principal components and canonical variates. A high degree of association among four statistics, used in this study, was reported by Cruz et al (1994) and Pires (1993). Such literature results obtained with different species, demonstrated fundamentally the relative consistency of the multivariate techniques and acknowledged, with certain restriction, the application of any one of them in studies on the estimate of genetic divergence between populations.

However, our study did not corroborate these results. The four multivariate statistics presented close association only when considered in pairs: in this case the Mahalanobis' (D2) with the Euclidean distance obtained from canonical variates (dcv), and mean Euclidean distance (de) with the Euclidean distance obtained from principal components (dpc). The Euclidean distance is a Pythagorean distance, extended to orthogonal multiple axes. It then presupposes independence between the characters analyzed (Sneath and Sokal, 1973). Since in our study the characters were intercorrelated, particularly NCFP, NHFP and WHSP, the application of Euclidean distance could be inadequate. For this reason, when calculating Euclidean distance from the scores of first principal components, the same result was obtained. In fact, these two last measures of distance were closely associated with each other (r = 0.97 in Table 1). On the other hand, both distances were not associated with the two remaining distance measures (Table 1).

Essentially it was the error variance-covariance matrix that made the difference, as it was considered in Mahalanobis' distance and canonical variate but not in Euclidean distance and PCA. According to Van Laar (1991), the measures of distance which consider all p variances as well as all p(p-1)/2 covariances (as Mahalanobis' distance and Euclidean distance obtained from canonical variates) are more suitable for population divergence studies than mere Euclidean distance. The discrepancy verified among the statistics constituted an additional complication. It is especially true in studies on divergence between accessions arranged in germplasm banks or even under natural conditions, where usually there is no replication. In these cases, however, the error variance-covariance matrix is likely to be obtained from the variation among plants within the accessions. Nevertheless, in cases where only the means of accessions are available, the application of the Euclidean distance to the original or standardized data and to the scores of principal components is the only feasible solution, although the possibility of a successful analysis be reduced.

Importance of Characters for Divergence: The implications of the inclusion of redundant characters in this study was possible to be quantified. The topic has been treated in the light of the relative importance of different characters for divergence and the possibility of discarding them, without distortion in the distance matrix taking place. In order to identify the most potent characters in the determination of genetic divergence, Singh's (1981) criterion was applied to Mahalanobis' distance matrix. The relative contributions of characters in terms of percentage of overall D2 were 38.6% to NCFP, 32.5% to WHSF, 16.9% to NHFP, 8.8% to WHSP and 3.2% to PDFP. When applied to canonical variates (see Table 2), Singh's criterion identified NHFP and WHSP as being of minor importance to divergence. In relation to principal components, the characters of minor importance for the divergence were NCFP and WHSP (as it can be seen in Table 1), when Jolliffe's (1972, 1973) criterion was employed. Hence, the different rejection criteria used have indicated precisely those characters known to be redundant.

Where certain characters are highly correlated, these are denoted redundant and should be excluded from analysis. Thus, if NHFP and WHSP were discarded these characters would be represented by NCFP (phenotypic correlations of 0.99 and 0.93, respectively, with NCFP). In practice, when the simultaneous discarding of NHFP and WHSP was made, no distortion in the Mahalanobis' distance matrix and Euclidean distance matrix obtained from canonical variates was observed (Table 3), since these distances introduced a scale-invariant. In relation to the Mahalanobis' distance matrix, the inclusion of character stand in the analysis did not alter the referred matrix (data not shown). Again, the robustness of Mahalanobis' distance was evident. According to Singh (1981), the Mahalanobis' distance must not be distorted when additional characters are considered. The facts revealed the robustness of these two multivariate techniques in the estimation of the divergence.

Jolliffe (1972, 1973) has demonstrated that results obtained from real data and artificial data analyzed by principal components were little changed if some of the variables, which were previously known to be redundant, were discarded. In this study, when the discarding of both NHFP and WHSP was simulated, there was a change in the mean Euclidean distance matrix and mean Euclidean distance matrix obtained from principal components (see Table 3). In these cases, the de and dpc statistics identified cultivar pairs 1, 4 and 2, 4 as the most divergent and only pair 1, 5 as the most similar, repeating part of results obtained with the D2 and dcv statistics. Hence, the presence of redundant characters influenced the de and dpc statistics and it provided distortion in the distance matrix, since these statistics are not sufficiently robust to support redundancy in high degree.

Received: 17 March 1998

Revised: 14 May 1998

Accepted: 07 August 1998

  • Calamassi, R.; Puglisi, S.R. & Vendramin, G.G. (1988), Genetic variation in morphological and anatomical needle characteristics in Pinus brutia Ten.. Silvae Genet., 37, 199-206
  • Cruz, C.D.; Vencovsky, R. & Carvalho, S.P. (1994), Estudos sobre divergência genética. III. Comparação de técnicas multivariadas. Rev. Ceres., 41, 191-201
  • Das, P.K. & Das Gupta, T. (1984), Multivariate analysis in black gram {Vigna mungo (L) Hepper}. The Indian J. Genet. Plant Breed., 44, 243-247
  • Dias, L.A.S. (1998), Análises multidimensionais. In: Eletroforese de Isoenzimas e Proteínas Afins: fundamentos e aplicações em plantas e microrganismos, (Ed.) A. C. Alfenas, Editora UFV, Viçosa, pp. 405-475
  • Dias, L.A.S. & Kageyama, P.Y. (1995), Combining-ability for cacao (Theobroma cacao L.) yield components under southern Bahia conditions. Theor. Appl. Genet., 90, 534-541
  • Dias, L.A.S. & Kageyama, P.Y. (1997a), Multivariate genetic divergence and hybrid performance of cacao (Theobroma cacao L.). Brazil. J. Genetics, 20, 63-70
  • Dias, L.A.S. & Kageyama, P.Y. (1997b), Temporal stability of multivariate genetic divergence in cacao (Theobroma cacao L.) in Southern Bahia conditions. Euphytica., 93, 181-187
  • Dias, L.A.S.; Kageyama, P.Y. & Castro, G.C.T. (1997), Divergência fenética multivariada na preservação de germplasma de cacau (Theobroma cacao L.). Agrotrópica., 9 29-40
  • Godoi, C.R.M. (1985), Análise Estatística Multidimensional Departamento de Matemática, ESALQ, USP, Piracicaba
  • Hussaini, S.H., Goodman, M.M. & Timothy, D.H. (1977), Multivariate analysis and the geographical distribution of the world collection of finger millet. Crop Sci, 17, 257-263
  • Jain, A.K., Dhagat, N.K. & Tiwari, A.S. (1981), Genetic divergence in finger millet. The Indian J. Genet. Plant Breed., 41, 346-348
  • Jolliffe, I.T. (1972), Discarding variables in a principal component analysis. I. Artificial data. Appl. Stat., 21, 160-173
  • Jolliffe, I.T. (1973), Discarding variables in a principal component analysis. II. Real data. Appl. Stat., 22, 21-31
  • Maluf, W.R. & Ferreira, P.E. (1983), Análise multivariada da divergência genética em feijão-vagem (Phaseolus vulgaris L.). Brazil. Hort., 1, 31-34
  • Maluf, W.R.; Ferreira, P.E. & Miranda, J.E.C. (1983), Genetic divergence in tomatoes and its relationship with heterosis for yield in F1 hybrids. Brazil. J. Genet., 6, 453-460
  • Narayan, R.K.J. & Macefield, A.J. (1976), Adaptive responses and genetic divergence in a world germplasm collection of chick pea (Cicer arietinum L.). Theor. Appl. Genet., 47, 179-187
  • Pires, C.E.L.S.. (1993), Diversidade genética de variedades de cana-de-açucar (Saccharum spp.) cultivadas no Brasil. Tese de Doutorado, ESALQ, USP, Piracicaba
  • Ramanujam, S.; Tiwari, A.S. & Mehra, R.B. (1974), Genetic divergence and hybrid performance in mung bean. Theor. Appl. Genet., 45, 211-214
  • Rao, C.R. (1952), Advanced Statistical Methods in Biometric Research. John Wiley, New York
  • Singh, D. (1981), The relative importance of characters affecting genetic divergence. The Indian J. Genet. Plant Breed., 41, 237-245
  • Sneath, P.H.A. & Sokal, R.R. (1973), Numerical taxonomy Freeman, San Francisco
  • Van Laar, A. (1991), Forest Biometry Sappi Forests, Stellenbosch
  • Address to correspondence
    Luiz Antônio dos Santos Dias
    Departamento de Biologia Geral, Universidade Federal de Viçosa
    CEP 36.571-000, Viçosa, MG, Brasil
  • Publication Dates

    • Publication in this collection
      27 July 2011
    • Date of issue
      1998

    History

    • Accepted
      07 Aug 1998
    • Reviewed
      14 May 1998
    • Received
      17 Mar 1998
    Instituto de Tecnologia do Paraná - Tecpar Rua Prof. Algacyr Munhoz Mader, 3775 - CIC, 81350-010 Curitiba PR Brazil, Tel.: +55 41 3316-3052/3054, Fax: +55 41 3346-2872 - Curitiba - PR - Brazil
    E-mail: babt@tecpar.br