Acessibilidade / Reportar erro

Trait selection using procrustes analysis for the study of genetic diversity in Conilon coffee

ABSTRACT.

Trait selection is occasionally necessary to save money and time, as well as accelerate breeding program processes. This study aimed to propose two criteria to select traits based on a Procrustes analysis that are poorly explored in genetic breeding: Criterion 1 (backward algorithm) and Criterion 2 (exhaustive algorithm). Then, these two criteria were further compared with Jolliffe’s criterion, which has often been used to select traits in genetic diversity studies. Sixteen agronomic traits were considered, and 40 Conilon coffee (Coffea canephora) accessions were evaluated. This study showed that the flexibility in selecting traits by researcher preference, graphical visualization, and Procrustes M2 tatistic through criteria 1 and 2 is a fast and reliable alternative for decision-making. These decisions are based on the removal and addition of traits for phenotyping in studies of Conilon coffee diversity that can be applied to other crops. Other relevant aspects of selection traits criteria were also discussed.

Keywords:
principal components; graphical comparison; selection criteria; discard of variables; plant breeding

Introduction

Studies on genetic diversity play an important role in breeding programs because they are crucial at the initial phase, called prebreeding, in which it is possible to regenerate, characterize, explore, and promote the conservation of variability available on the base population. Moreover, the information collected at the prebreeding phase is useful in obtaining potential candidates to generate divergent parents. These parents are more likely to promote satisfactory results regarding the genetic potential of derived cultivars or lineages as well as combining their abilities to obtain heterotic hybrids.

Multivariate techniques such as discriminant analysis, principal component analysis, coordinate analysis (Cruz, Ferreira, & Pessoni, 2011Cruz, C. D., Ferreira, F. M., & Pessoni, L. A. (2011). Biometria aplicada ao estudo da diversidade genética. Visconde do Rio Branco, MG: Suprema.), and clustering are used in this kind of study. Several inferences about genetic diversity studies can be made for different purposes of a breeding program using principal component analysis (PCA) (Yousaf et al., 2018Yousaf, M. I., Hussain, K., Hussain, S., Ghani, A., Arshad, M., Mumtaz, A., & Hameed, R. A. (2018). Characterization of indigenous and exotic maize hybrids for grain yield and quality traits under heat stress. International journal of Agriculture and Biology, 20(2), 333-337. DOI: 10.17957/IJAB/15.0493
https://doi.org/10.17957/IJAB/15.0493...
; Liu et al., 2017Liu, S., Zheng, X., Yu, L., Feng, L., Wang, J., Gong, T., ... Xu, R. (2017). Comparison of the genetic structure between in situ and ex situ populations of Dongxiang wild rice (Oryza rufipogon Griff.). Crop Science, 57(6), 3075-3084. DOI: 10.2135/cropsci2017.01.0015
https://doi.org/10.2135/cropsci2017.01.0...
; Muleta, Bulli, Zhang, Chen, & Pumphrey, 2017Muleta, K. T., Bulli, P., Zhang, Z., Chen, X., & Pumphrey, M. (2017). Unlocking diversity in germplasm collections via genomic selection: A case study based on quantitative adult plant resistance to stripe rust in spring wheat. The Plant Genome, 10(3), 1-15. DOI: 10.3835/plantgenome2016.12.0124
https://doi.org/10.3835/plantgenome2016....
; Yano, Nonaka, & Ezura, 2018Yano, R., Nonaka, S., & Ezura, H. (2018). Melonet-DB, a grand RNA-Seq gene expression atlas in melon (Cucumis melo L.). Plant and Cell Physiology, 59(1), e4-e4. DOI: 10.1093/pcp/pcx193
https://doi.org/10.1093/pcp/pcx193...
).

The genetic diversity existing among and within populations can be measured by the difference between the phenotypic values of their accessions and is obtained in field experiments using a considerable number of morphological, agronomic, and other traits of the studied cultivars. If a collection of accessions evaluated in a given experiment comes from a population or a germplasm bank, it can be re-evaluated in future studies for a variety of purposes. In some situations with a high cost and degree of difficulty involved to obtain a particular trait(s), it may be valuable to evaluate a smaller number of traits than those recorded in the germplasm bank. However, variability is a factor of extreme importance in the development of new varieties and in the conservation of genetic resources, and it is the breeder’s responsibility to investigate the extent of exclusion of one or multiple traits that will affect the present variability in the group of accessions under analysis.

The relative importance of traits in genetic diversity studies can be achieved using the criteria proposed by Singh (1981Singh, D. (1981). The relative importance of characters affecting genetic divergence. Indian Journal of Genetics and Plant Breeding, 41(2), 237-245.) and Jolliffe (1972Jolliffe, I. T. (1972). Discarding variables in a principal component analysis. I: Artificial data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 21(2), 160-173. DOI: 10.2307/2346488
https://doi.org/10.2307/2346488...
). However, the use of each one is restricted to the initial choice of the researcher regarding the clustering method used in the study of genetic diversity since both methods have different approaches.

The first criterion is used when the diversity is evaluated based on the dissimilarity measuring (distance measured between accession pairs) information to provide the cluster analysis. The second criterion is based on principal components that will generate graphic dispersion information in two or three-dimensional space.

In addition to those traits discarding the abovementioned criteria, there is also another methodology based on Procrustes analysis. Although it is rarely used in genetic diversity studies (especially in traits selection), the Procrustes approach has been used in many different areas including food engineering (Oliveira & Benassi, 2010Oliveira, A. P. V., & Toledo Benassi, M. de (2010). Avaliação sensorial de pudins de chocolate com açúcar e dietéticos por perfil livre. Ciência e Agrotecnologia, 34(1), 146-154. DOI: 10.1590/S1413-70542010000100019
https://doi.org/10.1590/S1413-7054201000...
; Mauricio, Palazzo, Caselato, & Bolini, 2016Mauricio, A. A., Palazzo, A. B., Caselato, V. M., & Bolini, H. M. A. (2016). Generalized procrustes analysis and external preference map used to consumer drivers of diet gluten free product. Food and Nutrition Sciences, 7(9), 711-723. DOI: 10.4236/fns.2016.79072
https://doi.org/10.4236/fns.2016.79072...
) and health sciences (Douglas, 2004Douglas, T. S. (2004). Image processing for craniofacial landmark identification and measurement: a review of photogrammetry and cephalometry. Computerized Medical Imaging and Graphics, 28(7), 401-409. DOI: 10.1016/j.compmedimag.2004.06.002
https://doi.org/10.1016/j.compmedimag.20...
; Daboul, Ivanovska, Bülow, Biffar, & Cardini, 2018Daboul, A., Ivanovska, T., Bülow, R., Biffar, R., & Cardini, A. (2018). Procrustes-based geometric morphometrics on MRI images: An example of inter-operator bias in 3D landmarks and its impact on big datasets. PLoS ONE, 13(5), e0197675. DOI: 10.1371/journal.pone.0197675
https://doi.org/10.1371/journal.pone.019...
). Thus, its application has shown great promise and has been evaluated in several studies. However, the technique has been poorly explored in genetic breeding (Klingenberg, 2003Klingenberg, C. P. (2003). Quantitative genetics of geometric shape: heritability and the pitfalls of the univariate approach. Evolution, 57(1), 191-195. DOI: 10.1111/j.0014-3820.2003.tb00230.x
https://doi.org/10.1111/j.0014-3820.2003...
; Bramardi, Bernet, Asíns, & Carbonell, 2005Bramardi, S. J., Bernet, G. P., Asíns, M. J., & Carbonell, E. A. (2005). Simultaneous agronomic and molecular characterization of genotypes via the Generalised Procrustes Analysis. Crop Science, 45(4), 1603-1609. DOI: 10.2135/cropsci2004.0633
https://doi.org/10.2135/cropsci2004.0633...
; García-Peña & Dias, 2009García-Peña, M., & Dias, C. T. S. (2009). Análise dos modelos aditivos com interação multiplicativa (AMMI) bivariados. Revista Brasileira de Biometria, 27(4), 586-602.) and, therefore, was the main motivation for this paper.

The Procrustes analysis technique allows a comparison of two configurations or two datasets as long as each line corresponding to the same individual. If two vectors are different from each other but are defined in the same subspace, it is possible to estimate the extent of the differentiation of their respective graphical representations by means of the Procrustes M2 statistic. Thus, the smaller the value of this statistic is, the more similar the two configurations will be.

Krzanowski (1987Krzanowski, W. J. (1987). Selection of variables to preserve multivariate data structure, using principal components. Journal of the Royal Statistical Society: Series C (Applied Statistics), 36(1), 22-33. DOI: 10.2307/2347842
https://doi.org/10.2307/2347842...
) presents a methodology that combines PCA, which is used to obtain the scores of the configurations, and Procrustes analysis to determine how much a subset of traits represents a structure of the original dataset (with all traits). The author discusses a Procrustes analysis from two perspectives: for a selection of traits from the backward elimination algorithm using the Procrustes statistic as a discard criterion and for a comparison of grouping patterns of different trait components resulting from different selection methods using the same statistics.

Based on the strategy proposed by Krzanowski (1987Krzanowski, W. J. (1987). Selection of variables to preserve multivariate data structure, using principal components. Journal of the Royal Statistical Society: Series C (Applied Statistics), 36(1), 22-33. DOI: 10.2307/2347842
https://doi.org/10.2307/2347842...
) and using the Procrustes M2 statistic given by Peres-Neto and Jackson (2001Peres-Neto, P. R., & Jackson, D. A. (2001). How well do multivariate data sets match? The advantages of a Procrustean superimposition approach over the Mantel test. Oecologia, 129(2), 169-178. DOI: 10.1007/s004420100720
https://doi.org/10.1007/s004420100720...
), our objective is to propose two cut-off criteria called the backward algorithm (Criterion 1) and the exhaustive algorithm (Criterion 2) for the selection of traits in the genetic diversity study of Conilon coffee (Coffea canephora). To validate the methodology, we will compare both with Jolliffe’s criterion (1972Jolliffe, I. T. (1972). Discarding variables in a principal component analysis. I: Artificial data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 21(2), 160-173. DOI: 10.2307/2346488
https://doi.org/10.2307/2346488...
), since it is considered a more efficient trait discarding method used for genetic diversity studies, providing more savings in a breeding program.

Material and methods

Database

The databases provided by Ferrão et al. (2008Ferrão, R. G., Cruz, C. D., Ferreira, A., Cecon, P. R., Ferrão, M. A. G., Fonseca, A. F. A. D., ... Silva, M. F. D. (2008). Parâmetros genéticos em café Conilon. Pesquisa Agropecuária Brasileira, 43(1), 61-69. DOI: 10.1590/S0100-204X2008000100009
https://doi.org/10.1590/S0100-204X200800...
) refer to the means of the characteristics. The experimental design utilized randomized blocks with six replications for PH and DCA and 4 replications for the other characteristics with each plot consisting of two plants. The model effect considered genotypes as fixed, and analyses of the variance of the characteristics were performed based on the average number of plots from the following model:

Y i j = μ + G i + β j + e i j

where: Yij is the phenotypic valor of the ij-th observation referring to the i-th genotype in the j-th block; ( is the overall mean of the character; G i is the effect of the i-th genotype (i = 1, 2, ... , 40); (j is the effect of the j-th block (j = 1, 2, ..., 4 or 6); and eij is experimental error, eij~N(0, s2).

Sixteen agronomic traits from 40 Conilon coffee accessions were evaluated in the Sooretama municipality located in the Brazilian state of Espírito Santo in the year 2000. An evaluation was conducted for the number of days (D) between flowering and total fruit maturation; the grain yield (GY kg ha-1); the plant height (PH in cm); the diameter crown average (DCA in cm), taken at the "middle third" of the plant; the cherry and coconut dry coffee relationship (ChCo), taken in a 2 kg sample of cherry coffee and its dried weight; the cherry and green coffee relationship (ChBe), taken in a 2 kg sample of cherry coffee and its dried weight after processing; the coconut and green coffee relationship (CoBe), taken in a 2 kg sample of cherry coffee and its dried weight after processing; the coarse grain percentage (CG%); the “flat” grain percentage (FG%); the “mocha” grain percentage (MG%); the grain moisture percentage (GM%); the percentage of grains retained on the sieve mesh size #17 (S17); the percentage of grains retained on the sieve mesh size #15 (S15); the percentage of grains retained on the sieve mesh size #13 (S13); the percentage of grains retained on the sieve mesh size #11 (S11); and the medium strainer (MS) (medium grain size). According to Ferrão et al. (2008Ferrão, R. G., Cruz, C. D., Ferreira, A., Cecon, P. R., Ferrão, M. A. G., Fonseca, A. F. A. D., ... Silva, M. F. D. (2008). Parâmetros genéticos em café Conilon. Pesquisa Agropecuária Brasileira, 43(1), 61-69. DOI: 10.1590/S0100-204X2008000100009
https://doi.org/10.1590/S0100-204X200800...
), the coefficients of experimental variation in percentage (CVe) of the characteristics are D (0.05), GY (23.24), PH (5.29), DCA (6.75), ChCo (6.85), ChBe (5.38), CoBe (7.26), CG (65,93), FG (5.20), MG (32.90), GM (11.20) S17 (31.95), S15 (11.17), S13 (15.95), S11 (51.52), and SM (2.16), of which the majority is less than 30% and shows good experimental precision for the coffee crop (Bonomo et al., 2004Bonomo, P., Cruz, C. D., Viana, J. M. S., Pereira, A. A., Oliveira, V. R., & Carneiro, P. C. S. (2004). Seleção antecipada de progênies de café descendentes de “híbrido de timor” X “catuaí amarelo” e “catuaí vermelho. Acta Scientiarum. Agronomy, 26(1), 91-96. DOI: 10.4025/actasciagron.v26i1.1969
https://doi.org/10.4025/actasciagron.v26...
; Ferrão et al., 2008; Rodrigues, Brinate, Martins, Colodetti, & Tomaz, 2017Rodrigues, W. N., Brinate, S. V., Martins, L. D., Colodetti, T. V., & Tomaz, M. A. (2017). Genetic variability and expression of agro-morphological traits among genotypes of Coffea arabica being promoted by supplementary irrigation. Genetics and Molecular Research, 16(2). DOI: 10.4238/gmr16029563
https://doi.org/10.4238/gmr16029563...
).

Procrustes analysis

To contextualize the criteria proposed in this work, it is important to first present pertinent information about Procrustes analysis. This technique allows the comparison of two datasets or two configurations, as long as each line corresponds to the same individual. If there are two sets of vectors that differ from one another but that define the same subspace, this technique allows the user to measure the difference between their respective graphical representations by means of the Procrustes M2 statistic. When the comparison is performed for more than two datasets or configurations, it is defined as a generalized Procrustes analysis.

For the understanding of the technique, consider the triangles Y: A-B-C and Z̃: a-b-c as the representation of two configurations in a two-dimensional space (matrices of n = 3 individuals and p = 2 traits) with different size, location, and orientation (Figure 1a).

Figure 1
Representation of steps involved in a Procrustes analysis: (a) original configurations where the triangle ABC was used as reference configuration; (b) configurations after standardization (i.e., similar size and common center); (c) configurations after mirror reflection, if necessary; (d) configuration after rotation so that the sum of the squared differences between homologous observations (A/a, B/b, C/c) is a minimum (Peres-Neto & Jackson, 2001Peres-Neto, P. R., & Jackson, D. A. (2001). How well do multivariate data sets match? The advantages of a Procrustean superimposition approach over the Mantel test. Oecologia, 129(2), 169-178. DOI: 10.1007/s004420100720
https://doi.org/10.1007/s004420100720...
).

The difference between these configurations is obtained by means of a Procrustes analysis so that its corresponding points align as well as possible. Procrustes analysis is a procedure that minimizes the trace of the sum of squared differences between two configurations (i.e., two data matrices) in a multivariate Euclidean space (Equation [1]), which is obtained in two steps by adjusting the configuration Z̃ to a reference configuration Y.

Min{trace[Y-Z̃)(Y-Z̃)'} Equation [1]

First, the centering (translation) and scaling (dilation) are carried out in Y and Z̃ (Figure 1b), such that Y= (I-P)Y/tr[(I-P) Y' (I-P)] and Z̃=(I-P)Z̃/tr[(I-P)Z̃'(I-P)], where I is an identity matrix nxn and P is a matrix nxn with all elements equal to 1/n, followed by the reflection (Figure 1c), if necessary, and the rotation of Z̃ (Figure 1d) for its adjustment in Y. That is, Z̃is rationed to Z̃Q such that Q=VU' is the rotation matrix, U(V’ is the decomposition of singular values of Z̃'Y where ( is a diagonal matrix, and U and V are orthogonal matrices. Finally, we have the statistic M2 (Equation [2]) as a result of the comparison between Y and Z̃ , referred to as Procrustes statistics or residual sum of squares, ranging from zero to infinity.

M2=traceYY'+Z̃Z̃'-2Z̃Q'Y'=traço{YY'+Z̃Z̃'-2Σ} Equation [2]

According to Peres-Neto and Jackson (2001Peres-Neto, P. R., & Jackson, D. A. (2001). How well do multivariate data sets match? The advantages of a Procrustean superimposition approach over the Mantel test. Oecologia, 129(2), 169-178. DOI: 10.1007/s004420100720
https://doi.org/10.1007/s004420100720...
), the variation of the M2 statistic between 0 and 1 is restricted using the following transformation:

M2=1-(trace Σ)2 Equation [3]

Procrustes analysis, Procrustes transformation or Procrustes rotation give us the idea that the configurations should be as close as possible (in the same subspace) to compare them. Thus, the configurations under the same referential can be fairly compared, and the "real difference" between them can be quantified.

When Procrustes analysis is performed on the same configuration (Y = Z), we have M2 = 0, indicating a perfect fit. Thus, the smaller the value of the statistics M2 are, the more similar the configurations (García-Peña & Dias, 2009García-Peña, M., & Dias, C. T. S. (2009). Análise dos modelos aditivos com interação multiplicativa (AMMI) bivariados. Revista Brasileira de Biometria, 27(4), 586-602.).

Trait selection criteria

To reduce the number of agronomic traits, based on the Conilon coffee dataset, we initially selected a subset of traits using Procrustes analysis according to the methodology presented by Krzanowski (1987Krzanowski, W. J. (1987). Selection of variables to preserve multivariate data structure, using principal components. Journal of the Royal Statistical Society: Series C (Applied Statistics), 36(1), 22-33. DOI: 10.2307/2347842
https://doi.org/10.2307/2347842...
). The methodology presented by Krazanowski (1987) combines principal component analysis (PCA) and a Procrustes analysis (Figure 2) to determine how much a subset of traits represents the structure of the set of p original traits. Thus, after performing PCA on the matrices of the set of original traits Xnxp and the subset of q traits X̃nxq, the novel matrices obtain the configurations Ynxk and Z̃nxk represented by the scores of the data matrices to be compared. Thus, if the true dimensionality of the data is k, then Ynxk will be the true configuration and Z̃nxk is the corresponding approximation of the configuration based on only q traits. The difference between the two configurations Ynxk and Z̃nxk was calculated by the statistic M2 from the differences between the corresponding scores of these settings. The loss of information due to the exclusion of (p-q) represents the residue produced when only the traits of q were used instead of all p traits.

Figure 2
Diagram illustrating Procrustes analysis data by Krzanowski (1987Krzanowski, W. J. (1987). Selection of variables to preserve multivariate data structure, using principal components. Journal of the Royal Statistical Society: Series C (Applied Statistics), 36(1), 22-33. DOI: 10.2307/2347842
https://doi.org/10.2307/2347842...
).

The choice of k in the most different areas has generally been based on the first k principal components to explain the total variance as much as possible, while also maintaining as much information contained in the original dataset as possible. A fixed k = 2 value was used in order to compare two-dimensional graphical dispersions, which is very useful in genetic breeding. Therefore, the M2 statistic will characterize the disagreement between those two graphical representations, based on the distance between accessions presented on a single 2D chart.

The strategy proposed by Krzanowski (1987Krzanowski, W. J. (1987). Selection of variables to preserve multivariate data structure, using principal components. Journal of the Royal Statistical Society: Series C (Applied Statistics), 36(1), 22-33. DOI: 10.2307/2347842
https://doi.org/10.2307/2347842...
) uses the scores of the PCA to obtain the configurations. However, since his M2 statistic goes from zero to infinity, an immeasurable space, it is better to use the Procrustes M2 statistic provided by Peres-Neto and Jackson (2001Peres-Neto, P. R., & Jackson, D. A. (2001). How well do multivariate data sets match? The advantages of a Procrustean superimposition approach over the Mantel test. Oecologia, 129(2), 169-178. DOI: 10.1007/s004420100720
https://doi.org/10.1007/s004420100720...
), which gives a limited space. Accordingly, we established two criteria: backward (Criterion 1) and exhaustive (Criterion 2) algorithms for the selection of traits in the study of genetic diversity. We compare them with Jolliffe’s criterion (1972Jolliffe, I. T. (1972). Discarding variables in a principal component analysis. I: Artificial data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 21(2), 160-173. DOI: 10.2307/2346488
https://doi.org/10.2307/2346488...
) to validate the methodology. It is assumed that there is a trait’s subset that satisfactorily represents the original dataset structure with a minimal loss of information (represented by M2) regarding the original dataset. That is, the residue produced by the loss of information due to the discard of some of the traits is minimal, therefore, the cluster pattern of the evaluated accessions is not significantly affected.

Criterion 1: The Backward Algorithm

Based on the backward algorithm proposed by Krzanowski (1987Krzanowski, W. J. (1987). Selection of variables to preserve multivariate data structure, using principal components. Journal of the Royal Statistical Society: Series C (Applied Statistics), 36(1), 22-33. DOI: 10.2307/2347842
https://doi.org/10.2307/2347842...
) and considering M2 given by Equation [3], there is no stopping rule. The result is a sequence of (p-k) traits and their respective estimated M2 values. It is important to remember that k=2 is the number of principal components chosen to graphically evaluate the genetic diversity. Moreover, the decision about which traits to retain in the study is arbitrary.

To select the subset of traits by means of this algorithm, the purpose of Criterion 1 is to establish a cut-off value for the M2 statistic, called the M2 critical. The resulting subset of traits, named optimal selection, is the subset that has the M2 estimated value closer (less than or equal) to the M2 critical..

Criterion 2: The Exhaustive Algorithm

Considering all combinations with k, k+1 until (p-1) traits totalizing Cpk,Cpk+1,,Cpp-1subsets, respectively, it is possible find the optimal solution from the same M2 critical. Thus, a total of i=kp-1Cpi analyses were performed on all subsets and characterized a new procedure referred to as exhaustive algorithm, which certainly demands greater computational effort than Criterion 1.

From M2 critical.= 0.1, this procedure provided a series of subsets of traits with M2 values lower than the M2 critical. However, the optimal selection was the one that resulted in the M2 estimated to be less than or equal to the M2 critical.

Jolliffe’s criterion: Traits discarded by the principal component’s technique

The trait subsets obtained by Criteria 1 and 2 were compared with the subset obtained according to Jolliffe’s criterion (1972Jolliffe, I. T. (1972). Discarding variables in a principal component analysis. I: Artificial data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 21(2), 160-173. DOI: 10.2307/2346488
https://doi.org/10.2307/2346488...
), which considered the removal of traits with greater weights for less important components (minor variance). Considering standardized traits for genetic diversity studies, Cruz et al. (2011Cruz, C. D., Ferreira, F. M., & Pessoni, L. A. (2011). Biometria aplicada ao estudo da diversidade genética. Visconde do Rio Branco, MG: Suprema.) recommend that the number of traits to discard should be equal to the number of components with eigenvalues less than 0.7.

To avoid the traits with a greater variance affecting the grouping result, the data standardization is commonly used before the PCA since the PCA is obtained from the covariance matrix ((). Here, statistical standardization of the data was performed where each value was subtracted by the mean and divided by the standard deviation of its respective variable. After standardization, the principal components are obtained from the covariance matrix of the standardized data, according Mingoti (2005Mingoti, S. A. (2005). Análise de dados através de métodos de estatística multivariada: uma abordagem aplicada. Belo Horizonte, MG: Editora UFMG. ), which is the same as obtaining the principal components from the correlation matrix (R) of the original dataset. Thus, we have ( = R.

All statistics were performed in GENES software version 2016 (Cruz, 2013Cruz, C. D. (2013). Genes: a software package for analysis in experimental statistics and quantitative genetics. Acta Scientiarum. Agronomy, 35(3), 271-276. DOI: 10.4025/actasciagron.v35i3.21251
https://doi.org/10.4025/actasciagron.v35...
; 2016Cruz, C. D. (2016). Genes Software-extended and integrated with the R, Matlab and Selegen. Acta Scientiarum. Agronomy, 38(4), 547-552. DOI: 10.4025/actasciagron.v38i4.32629
https://doi.org/10.4025/actasciagron.v38...
). GENES software is available on http://www.ufv.br/dbg/genes/genes.htm.

Results and discussion

A M2 critical = 0.1, value obtained a subset with six, eight, and seven Conilon coffee traits according to criterion 1 (backward algorithm), criterion 2 (exhaustive algorithm), and Jolliffe’s criterion, respectively (Table 1). Additionally, common traits existed between the subsets given by optimal selection from criteria 1 and 2, such as MG%, S15, and MS. The importance of these traits to accessions variability of the Conilon coffee is shown since the subsets selected by both criteria provided an increase in total variance explained by the first two principal components.

Table 1
Optimal selection using Procrustes and Jolliffe criteria.

The 2D graphical dispersion of accessions of Conilon coffee, considering all 16 traits (Figure 3a), represents the original data configuration and explains 49.35% of total variance. Although the 40 accessions could not be grouped in clusters, the graphical dispersion was considered useful in making inferences about Conilon coffee accessions diversity in this study. Figure 3a shows that accessions 13 and 8 are divergent, and according to their per se potential, they can be used in a cross to explore vigor and increase variability.

Figure 3
Dispersion ranking of Conilon coffee accessions using two principal components (CP1 and CP2), according to: A - evaluation of 16 agronomic traits, B - Criteria 1 without Procrustes transformation, C - Criteria 1 with Procrustes transformation, and D - A and C superimposed (M2 = 0.089445).

Figure 3b shows the scores graphical dispersion of the accessions in relation to the first two components for the optimal selection resulted by Criterion 1. Note the change on the position of the accessions since they were reflected around the origin of the component 2 in relation to the original configuration given by Figure 3a (accessions now positive but were previously negative). To make the matching between these configurations feasible, following the steps described in material and methods and illustrated in Figure 2, the Procrustes analysis adjusted the configuration of Figure 3b in 3a such that the distance between them is minimal. After the Procrustes transformation on the optimal selection, it was then possible to calculate its true difference in relation to the original dataset estimated by means of M2critical = 0.0895.

It was verified that accession 8, which was previously divergent such as accession 13, was now in the same group of genotypes that included accession 14 (Figure 3c). Thus, we have the optimal selection with six traits that did not satisfactorily represent the diversity pattern from the dispersion given by the original dataset (Figure 3a). We can better visualize the change in the clustering pattern of accessions by superimposing the graphs a and c (Figure 3d). It is worth pointing out that even if the optimal selection included the characteristics of interest, it was not adequate for evaluating the diversity of Conilon coffee accessions.

Criterion 1 provided a sequence of traits whose exclusion at each step of the backward algorithm provided the lowest estimation of the M2 value (Table 2). Note that the estimated M2 value is increased by discarding the traits in each step of the algorithm. This was expected, as the discarding of variables increased the residue produced by the loss of information compared to the original dataset. As a six-variable subset resultant of criterion 1 did not satisfactorily represent the structure of accessions diversity, the researcher can choose a different M2 critical value (higher or lower than the last one used). Therefore, more or fewer traits are considered for a re-evaluation of the clustering pattern among accessions according to how much loss of information the researcher can tolerate. If the subsequent M2 critical values are inappropriate, the researcher can follow the study considering the information of all sixteen traits or evaluate different subsets given by Criterion 2.

Table 2
Variables excluded by the backward algorithm for the Conilon coffee.

Figure 4b shows that the accessions given by the optimal selection of Criterion 2 were reflected, as in Criterion 1, but now in relation to the origin of components 1 and 2, simultaneously. According to the dispersion of the accessions presented by the optimal selection resulting from Criterion 2, no change in the clustering pattern (Figure 4d) was observed. Therefore, the optimal selection (Figure 4c) provided a global dispersion satisfactorily close to the given dispersion of the original dataset (Figure 4a).

Figure 4
Dispersion ranking of Conilon coffee accessions using two principal components (CP1 and CP2), according to A - evaluation of 16 agronomic traits, B - Criteria 2 without Procrustes transformation, C - Criteria 2 with Procrustes transformation, and D - A and C superimposed (M2 = 0.1)

Criterion 2 provided a total of 9,841 combinations (subsets) that resulted in M 2 values lower than M 2 critical (Table 3), which include the optimal selection resulting from Criterion 1. If the optimal selection of Criterion 2 does not satisfy the breeder's purposes, it is possible to evaluate the diversity of other subsets with more or fewer traits.

Table 3
Total subsets determined by the exhaustive algorithm with M2 critical < 0.1.

From the data presented in Table 4, it is possible to identify the relative importance of the traits on the genetic diversity of the Conilon coffee accessions through which the deletion must be performed. According to a criterion presented by Jolliffe (1972Jolliffe, I. T. (1972). Discarding variables in a principal component analysis. I: Artificial data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 21(2), 160-173. DOI: 10.2307/2346488
https://doi.org/10.2307/2346488...
) and suggested by Cruz et. al. (2011Cruz, C. D., Ferreira, F. M., & Pessoni, L. A. (2011). Biometria aplicada ao estudo da diversidade genética. Visconde do Rio Branco, MG: Suprema.), from the last to the ninth principal component, the traits of greatest weights were S13, MS, MG%, ChBe, S11, GY, PH, S15, and GM%. Accordingly, the optimal selection was given by the subset of traits: D, DCA, ChCo, CoBe, CG%, FG%, and S17.

Table 4
Eigenvalue estimates from the correlation matrix, containing 16 traits and associated eigenvectors (components).

Figure 5b shows the dispersion of the accessions scores in relation to the first two principal components for the subset of seven traits established by Jolliffe’s criterion. As in previous cases, the change of accessions position occurred due to the exclusion of some traits, which were reflected around the origin of component 1 and component 2. After the Procrustes transformation on the optimal selection, its real difference in relation to the original set was estimated by M2 = 0.3359. The estimated magnitude of M2 translated the nonproximity between the coffee accessions corresponding to the configurations (Figure 5d), which revealed a significant change in the pattern of clustering of the accessions. This difference can be seen in accession 19, which was fitted to accession group 13 after transformation, as well as accessions 16, 17, and 35, all belonging to accession group 8 after the transformation (Figure 5c).

From the moment the researcher knows which traits are of greater biological importance on the characteristic to be improved, their use can reflect their importance and lead to saving time and financial resources, making breeding programs more sustainable. Thus, if the breeder has an interest in a specific subset, its diversity can be evaluated graphically and its estimated value of M2 compared to that obtained by optimal selection of the exhaustive or backward algorithm as a way of guiding discovery of how the magnitude of M2 is affecting the dispersion of its accessions group.

Figure 5
Dispersion ranking of Conilon coffee accessions using two principal components (CP1 and CP2), according to A - evaluation of 16 agronomic traits, B - Jolliffe criteria without Procrustes transformation, C - Jolliffe criteria with Procrustes transformation, and D - A and C superimposed (M2 = 0.33585).

According to the obtained results it can be observed that the optimal selection given by Criterion 1 provided the lowest value of M2 estimated and the smallest number of traits. However, this did not adequately represent the Conilon coffee diversity considering the PCA from the set with all 16 traits (Figure 5a). Furthermore, the subset selected by Criterion 2, despite having a greater number of traits, satisfactorily represented the diversity among the accessions. Note that the subset selected by Jolliffe’s criterion provided a high value of estimated M2, which was 3 times more than the critical M2 critical, revealing a change in the cluster pattern of the accessions and making this criterion relatively less efficient than the others.

Based on the Procrustes analysis, the number of solutions of each criterion should be taken into account. In the case of Criterion 1 and Jolliffe's criterion, only one optimal selection was provided, while Criterion 2 provided all subsets of traits with an estimated value of M2 below M2 critical (Table 3). This opens a range of possibilities for the researcher's decision-making since the M2 critical and the backward algorithm may not include some variables that present biological importance into the process of genetic improvement of the culture. Additionally, the subset selected by Criterion 1 may not reveal graphical scatter equivalent to that obtained by the analysis of the original set.

We also must pay attention to the process of obtaining solutions. Unlike Criterion 1, which excludes one variable at a time in each step of the backward algorithm, Criterion 2 uses the exhaustive algorithm that evaluates all possibilities of discarding traits - one by one, two by two, etc. The stepwise algorithm, which is useful in selecting traits in linear regression models, is different from the method for Criterion 2 because it establishes the importance of traits by a different decision rule and the exclusion or inclusion of traits is made interactively.

It is possible to verify the total analyses performed by the exhaustive algorithm according to the number of traits studied (Table 5). Note that as the number of traits increases, the number of analyses performed by Criterion 2 increases considerably. Thus, Criterion 2 becomes uninteresting in cases of high-order data matrices whose handling involves high computational cost, and processing the results may take months. However, the researcher must consider its use by computational resources as well as the time it has since there currently are no studies that establish the computational cost of this algorithm in relation to the number of traits.

Table 5
Total number of analyzes performed by exhaustive algorithm on the Conilon coffee.

Notice that the M2 statistic used in this work ranged from 0 to 1, and the M2 critical value can be interpreted as the percentage loss of information acceptable resulting from the selected subset of traits. Thus, the researcher must consider that even if a relatively small loss is established, the resulting subset of traits may or may not satisfactorily represent the genetic diversity of the original dataset. This occurs because the breeder’s considerations of the biological importance of a variable may be different from the statistical significance. Therefore, the optimal selection must include all traits that are important to the breeder and represent the level of diversity in the original dataset.

Another interesting aspect about the criteria based on the Procrustes analysis concerns the value of M2 critical = 0.1 suggested in this study. It is worth mentioning that in Criteria 1 and 2, the critical value M2 can be slightly relaxed according to the number of traits that the breeder wishes to discard. In this sense, we suggest a variation interval from a minimum value of 0.05 to a maximum of 0.15, as long as the clustering pattern of accessions of the culture is maintained. These limits do not constitute a rule since there are no other studies that discard traits using these specific limits for genetic diversity, and therefore, the researcher must decide them. However, it is worth remembering that the Procrustes statistic adopted in this work ranges from 0 to 1, M2 critical and was selected assuming that the residual produced by the loss of information with the resulting subset of traits would be 10% at most.

Based on the strategy proposed by Krzanowski (1987Krzanowski, W. J. (1987). Selection of variables to preserve multivariate data structure, using principal components. Journal of the Royal Statistical Society: Series C (Applied Statistics), 36(1), 22-33. DOI: 10.2307/2347842
https://doi.org/10.2307/2347842...
) and the Krzanowski (1996Krzanowski, W. J. (1996). A stopping rule for structure-preserving variable selection. Statistics and Computing, 6(1), 51-56. DOI: 10.1007/BF00161573
https://doi.org/10.1007/BF00161573...
) backward algorithm, Munita, Barroso, and Oliveira (2013Munita, C. S., Barroso, L. P., & Oliveira, P. M. (2013). Variable selection study using Procrustes analysis. Open Journal of Archaeometry, 1(e7), 31-35. DOI: 10.4081/arc.2013.e7
https://doi.org/10.4081/arc.2013.e7...
) obtained their results with a subset of only eight traits sufficient to interpret the data in two axes (k = 2 principal components) that explained 76.6% of the total variation without substantial loss of information. The dataset represented the concentration of 13 chemical elements (traits) obtained by activation with neutrons in a set of 40 samples of ceramic fragments, whose first two components explained 79.9% of the total variation. Guedes and Ivanqui (1998Guedes, T. A., & Ivanqui, I. L. (1998). Análise procrustes aplicada à seleção de variáveis. Acta Scientiarum. Technology, 20, 505-509. DOI: 10.4025/actascitechnol.v20i0.3073
https://doi.org/10.4025/actascitechnol.v...
) obtained similar results in a medical study with simulated data regarding 14 traits related to liver cancer, whose first two main components explained 93.61% of the total variation. Based on the backward algorithm without a stop rule, a subset with 8 traits was established by Procrustes analysis with configuration similar to the original set with representation in two axes that explained 93.66% of the data variation.

The results obtained in this study also showed that even with minimal explanation of the total variation of the data by the first two principal components, it was possible to obtain a satisfactory representation of the accessions diversity in two axes according to the optimal selection obtained by Procrustes analysis. This confirmed the importance of the contribution of the proposed criteria and the technique presented for the selection of traits in the study of genetic diversity. Finally, the exhaustive procedure, which suggests enormous potential for genetic studies, is highlighted by the number of resulting optimal solutions.

The Procrustes analysis presents wide applicability and has interesting approaches. For the plant breeding, there is currently no literature using Procrustes analysis to select phenotypic traits, which further highlights the relevance of this study for genetic improvement. Although Procrustes analysis has been minimally explored in the area of plant breeding, García-Peña and Dias (2009García-Peña, M., & Dias, C. T. S. (2009). Análise dos modelos aditivos com interação multiplicativa (AMMI) bivariados. Revista Brasileira de Biometria, 27(4), 586-602.) used the analysis to compare different techniques of uni- and multivariate analysis by the AMMI methodology in the genotypic versus environmental interaction study. The joint use of the Procrustes and PCA techniques presents enormous potential and its application in genetic improvement extends beyond the selection of variables, including the possibility of evaluating the genetic diversity that is important for a breeding program through graphic dispersion.

This study provides the breeder with a technique based on Procrustes analysis to assist him in the decision-making regarding the exclusion of redundant characters. In practical terms, character exclusion can reduce possible measurement errors and reduce experiment time and costs since the excluded character may require a high cost of measurement or be difficult to measure. Technically, Procrustes analysis in diversity studies allows for visualization of the pattern of grouping of accessions after discarding variables. This allows the breeder to graphically evaluate the selected subset of traits, either by an automated selection method or determined by the breeders themselves. In addition, it allows quantification, through the statistics M2, of the loss of information of a reduced subset of selected traits in relation to the set of all traits.

Conclusion

The flexibility in selecting traits by the researcher, graphical visualization, and Procrustes M2 statistics through Criteria 1 and 2 becomes a fast and reliable alternative for decision-making of traits for phenotyping in studies of Conilon coffee diversity as well as other crops. Procrustes analysis is advantageous in selecting traits and provides a relevant contribution to genetic diversity studies as an efficient alternative to Jolliffe’s criterion.

Acknowledgements

We would like to thank FAPEMIG, CAPES, and CNPq for the financial support and for conceding scholarships to the project development

References

  • Bonomo, P., Cruz, C. D., Viana, J. M. S., Pereira, A. A., Oliveira, V. R., & Carneiro, P. C. S. (2004). Seleção antecipada de progênies de café descendentes de “híbrido de timor” X “catuaí amarelo” e “catuaí vermelho. Acta Scientiarum. Agronomy, 26(1), 91-96. DOI: 10.4025/actasciagron.v26i1.1969
    » https://doi.org/10.4025/actasciagron.v26i1.1969
  • Bramardi, S. J., Bernet, G. P., Asíns, M. J., & Carbonell, E. A. (2005). Simultaneous agronomic and molecular characterization of genotypes via the Generalised Procrustes Analysis. Crop Science, 45(4), 1603-1609. DOI: 10.2135/cropsci2004.0633
    » https://doi.org/10.2135/cropsci2004.0633
  • Cruz, C. D. (2013). Genes: a software package for analysis in experimental statistics and quantitative genetics. Acta Scientiarum. Agronomy, 35(3), 271-276. DOI: 10.4025/actasciagron.v35i3.21251
    » https://doi.org/10.4025/actasciagron.v35i3.21251
  • Cruz, C. D. (2016). Genes Software-extended and integrated with the R, Matlab and Selegen. Acta Scientiarum. Agronomy, 38(4), 547-552. DOI: 10.4025/actasciagron.v38i4.32629
    » https://doi.org/10.4025/actasciagron.v38i4.32629
  • Cruz, C. D., Ferreira, F. M., & Pessoni, L. A. (2011). Biometria aplicada ao estudo da diversidade genética. Visconde do Rio Branco, MG: Suprema.
  • Daboul, A., Ivanovska, T., Bülow, R., Biffar, R., & Cardini, A. (2018). Procrustes-based geometric morphometrics on MRI images: An example of inter-operator bias in 3D landmarks and its impact on big datasets. PLoS ONE, 13(5), e0197675. DOI: 10.1371/journal.pone.0197675
    » https://doi.org/10.1371/journal.pone.0197675
  • Douglas, T. S. (2004). Image processing for craniofacial landmark identification and measurement: a review of photogrammetry and cephalometry. Computerized Medical Imaging and Graphics, 28(7), 401-409. DOI: 10.1016/j.compmedimag.2004.06.002
    » https://doi.org/10.1016/j.compmedimag.2004.06.002
  • Ferrão, R. G., Cruz, C. D., Ferreira, A., Cecon, P. R., Ferrão, M. A. G., Fonseca, A. F. A. D., ... Silva, M. F. D. (2008). Parâmetros genéticos em café Conilon. Pesquisa Agropecuária Brasileira, 43(1), 61-69. DOI: 10.1590/S0100-204X2008000100009
    » https://doi.org/10.1590/S0100-204X2008000100009
  • García-Peña, M., & Dias, C. T. S. (2009). Análise dos modelos aditivos com interação multiplicativa (AMMI) bivariados. Revista Brasileira de Biometria, 27(4), 586-602.
  • Guedes, T. A., & Ivanqui, I. L. (1998). Análise procrustes aplicada à seleção de variáveis. Acta Scientiarum. Technology, 20, 505-509. DOI: 10.4025/actascitechnol.v20i0.3073
    » https://doi.org/10.4025/actascitechnol.v20i0.3073
  • Jolliffe, I. T. (1972). Discarding variables in a principal component analysis. I: Artificial data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 21(2), 160-173. DOI: 10.2307/2346488
    » https://doi.org/10.2307/2346488
  • Klingenberg, C. P. (2003). Quantitative genetics of geometric shape: heritability and the pitfalls of the univariate approach. Evolution, 57(1), 191-195. DOI: 10.1111/j.0014-3820.2003.tb00230.x
    » https://doi.org/10.1111/j.0014-3820.2003.tb00230.x
  • Krzanowski, W. J. (1987). Selection of variables to preserve multivariate data structure, using principal components. Journal of the Royal Statistical Society: Series C (Applied Statistics), 36(1), 22-33. DOI: 10.2307/2347842
    » https://doi.org/10.2307/2347842
  • Krzanowski, W. J. (1996). A stopping rule for structure-preserving variable selection. Statistics and Computing, 6(1), 51-56. DOI: 10.1007/BF00161573
    » https://doi.org/10.1007/BF00161573
  • Liu, S., Zheng, X., Yu, L., Feng, L., Wang, J., Gong, T., ... Xu, R. (2017). Comparison of the genetic structure between in situ and ex situ populations of Dongxiang wild rice (Oryza rufipogon Griff.). Crop Science, 57(6), 3075-3084. DOI: 10.2135/cropsci2017.01.0015
    » https://doi.org/10.2135/cropsci2017.01.0015
  • Mauricio, A. A., Palazzo, A. B., Caselato, V. M., & Bolini, H. M. A. (2016). Generalized procrustes analysis and external preference map used to consumer drivers of diet gluten free product. Food and Nutrition Sciences, 7(9), 711-723. DOI: 10.4236/fns.2016.79072
    » https://doi.org/10.4236/fns.2016.79072
  • Mingoti, S. A. (2005). Análise de dados através de métodos de estatística multivariada: uma abordagem aplicada. Belo Horizonte, MG: Editora UFMG.
  • Muleta, K. T., Bulli, P., Zhang, Z., Chen, X., & Pumphrey, M. (2017). Unlocking diversity in germplasm collections via genomic selection: A case study based on quantitative adult plant resistance to stripe rust in spring wheat. The Plant Genome, 10(3), 1-15. DOI: 10.3835/plantgenome2016.12.0124
    » https://doi.org/10.3835/plantgenome2016.12.0124
  • Munita, C. S., Barroso, L. P., & Oliveira, P. M. (2013). Variable selection study using Procrustes analysis. Open Journal of Archaeometry, 1(e7), 31-35. DOI: 10.4081/arc.2013.e7
    » https://doi.org/10.4081/arc.2013.e7
  • Peres-Neto, P. R., & Jackson, D. A. (2001). How well do multivariate data sets match? The advantages of a Procrustean superimposition approach over the Mantel test. Oecologia, 129(2), 169-178. DOI: 10.1007/s004420100720
    » https://doi.org/10.1007/s004420100720
  • Oliveira, A. P. V., & Toledo Benassi, M. de (2010). Avaliação sensorial de pudins de chocolate com açúcar e dietéticos por perfil livre. Ciência e Agrotecnologia, 34(1), 146-154. DOI: 10.1590/S1413-70542010000100019
    » https://doi.org/10.1590/S1413-70542010000100019
  • Rodrigues, W. N., Brinate, S. V., Martins, L. D., Colodetti, T. V., & Tomaz, M. A. (2017). Genetic variability and expression of agro-morphological traits among genotypes of Coffea arabica being promoted by supplementary irrigation. Genetics and Molecular Research, 16(2). DOI: 10.4238/gmr16029563
    » https://doi.org/10.4238/gmr16029563
  • Singh, D. (1981). The relative importance of characters affecting genetic divergence. Indian Journal of Genetics and Plant Breeding, 41(2), 237-245.
  • Yano, R., Nonaka, S., & Ezura, H. (2018). Melonet-DB, a grand RNA-Seq gene expression atlas in melon (Cucumis melo L.). Plant and Cell Physiology, 59(1), e4-e4. DOI: 10.1093/pcp/pcx193
    » https://doi.org/10.1093/pcp/pcx193
  • Yousaf, M. I., Hussain, K., Hussain, S., Ghani, A., Arshad, M., Mumtaz, A., & Hameed, R. A. (2018). Characterization of indigenous and exotic maize hybrids for grain yield and quality traits under heat stress. International journal of Agriculture and Biology, 20(2), 333-337. DOI: 10.17957/IJAB/15.0493
    » https://doi.org/10.17957/IJAB/15.0493

Publication Dates

  • Publication in this collection
    03 July 2020
  • Date of issue
    2020

History

  • Received
    07 June 2018
  • Accepted
    21 Nov 2018
Editora da Universidade Estadual de Maringá - EDUEM Av. Colombo, 5790, bloco 40, 87020-900 - Maringá PR/ Brasil, Tel.: (55 44) 3011-4253, Fax: (55 44) 3011-1392 - Maringá - PR - Brazil
E-mail: actaagron@uem.br