Selection of the most informative morphoagronomic descriptors for cassava germplasm

– The objective of this work was to select the most informative morphoagronomic descriptors for cassava ( Manihot esculenta ) germplasm and to evaluate the ability of different methods to select the descriptors. Ninety‑five accessions were characterized using 51 morphoagronomic descriptors. Data were subjected to a multiple correspondence analysis (MCA), whose information was used in the following four methods of descriptor selection: reverse order of the descriptor for the pth factorial axis of the MCA (Jolliffe); sequential, multiple correspondence analysis (SMCA); mean of the contribution orders of the descriptor in the first three factorial axes (C3PA); and C3PA method weighted by the respective eigenvalues of the full analysis (C3PAWeig). The correlations between the dissimilarity matrix with all descriptors and the most informative descriptors were high and significant (0.75, 0.77, 0.83, and 0.84 for C3PAWeig, C3PA, SMCA, and Jolliffe, respectively). The less informative descriptors were discarded, considering those common among the selection methods and relevant for the breeding interests. Therefore, 32 morphoagronomic descriptors with correlation between the dissimilarity matrices (r=0.81) were selected, due to their high capacity to discriminate cassava germplasm and to their ability to maintain some preliminary agronomic traits, useful for the initial characterization of the germplasm.


Introduction
Most of the produced cassava (Manihot esculenta Crantz) is destined for food consumption (human and animal) and for industrial purposes.Cassava starch is a worldwide, multibillion-dollar business, especially due to its many industrial applications (Tonukari, 2004).
The wide range of cassava uses should be linked to the genetic diversity of the species, which guides the development of varieties with traits that meet the diverse specificities of the consumer market as well as the crop production system.Therefore, the maintenance of cassava genetic diversity ensures the conservation of useful alleles that are related to resistance to pests and diseases, conferring better root quality and differential starch characteristics (Raji et al., 2007;Oliveira et al., 2014).The availability of a wide genetic variability is essential for cassava breeding programs that are aimed at developing new varieties with levels of adaptation to cultivation in different environments and industrial adjustments.
The organization and maintenance of major germplasm banks and efforts for the collecting, characterization, evaluation, and use of wild species and landraces are of immeasurable importance in ensuring the sustainability of the cassava production chain, particularly in countries with continental dimensions, such as Brazil.The Brazilian cassava germplasm collections have been characterized primarily based on the analysis of morphological descriptors and cyanogenic compounds in the roots for the classification of genotypes as "sweet" or "bitter" cassava (Fukuda & Alves, 1987).
Some studies have been conducted for the Brazilian cassava germplasm using morphoagronomic descriptors and molecular markers, in order to evaluate the genetic diversity and associations between different traits (Vieira et al., 2011;Mezette et al., 2014).However, most of these studies used only part of the descriptors currently available for cassava and a small number of germplasm accessions, mainly due to the difficulty of using all descriptors in the entire collection.To reduce the work of characterization activities, several studies have been conducted to define a list of the most informative descriptors to distinguish accessions in gene banks (Strapasson et al., 2000;Castro et al., 2012;Oliveira et al., 2012;Silva et al., 2013).Reliable methods and procedures for germplasm characterization are essential to increase the use of available genetic variability in breeding programs (Oliveira et al., 2012).
In cassava, 75 descriptors have been used for germplasm characterization, of which 54 are morphological and 21 are agronomic traits (Fukuda & Guevara, 1998).Recently, Fukuda et al. (2010) published a revised version of their work, focusing on the documentation and characterization of cassava germplasm to analyze data and to draw comparisons among different countries.However, the selection of these descriptors was not carried out based on their discrimination power, which was identified using appropriate statistical tools.Furthermore, the number of descriptors still remains high for use in cassava germplasm characterization, requiring a large number of observations, which is extremely time-consuming and costly.
Although the definition of descriptors of greater importance has been made based on the experience of researchers (Coffelt & Johnson, 2011), the use of multivariate analysis techniques have been more effective in identifying descriptors of major interest, indicating the disposal of those less relevant descriptors (Strapasson et al., 2000;Giraldo et al., 2010;Castro et al., 2012;Oliveira et al., 2012;Silva et al., 2013).Besides, multivariate analyses, specifically the multiple correspondence analysis (MCA), have the advantage of assessing the importance of each studied descriptor in the total available variation among accessions, enabling the discard of the less discriminating descriptors which are invariant or correlated with other descriptors, similarly to the principal component analysis.
The objective of this work was to select the most informative descriptors for cassava (Manihot esculenta Crantz) germplasm evaluation and to evaluate the ability of different methods for descriptors selection.

Materials and Methods
Ninety-five germplasm accessions belonging to the Cassava Germplasm Bank (CGB) from Embrapa Mandioca e Fruticultura (Cruz das Almas, Brazil) were characterized from 2011-2013.The choice of the germplasm accessions was based on their high phenotypic contrasts observed under field conditions.This database consists of landraces and improved varieties which resulted from conventional breeding procedures, such as crossing and selection, as well as the selection of landraces with high yield potential, as identified by farmers or research institutions.
The accessions were planted at the beginning of the rainy season (May-July 2011-2013) using stem cuttings of 15-20 cm in single rows.Spacing was 0.90 m between rows and 0.80 m within rows, and the cultivation system was performed according to recommendations for cassava cultivation (Souza et al., 2006).Harvesting was done between 11 and 12 months after planting in the three years of evaluation.
Previously characterized with 51 descriptors according to Fukuda et al. (2010), the 95 accessions were used to define the list of most informative cassava descriptors (Table 1).The descriptors were divided into four categories, 13 minimum, 13 principal, 10 secondary and 15 preliminary agronomic descriptors (Fukuda et al., 2010).As most traits are qualitative, the quantitative traits were also categorized into the same analysis.In this case, seven classes for each quantitative trait were distributed, and this categorization was based on the range of data variation.
The multiple correspondence analysis (MCA) (Escofier & Pagès, 1992) was used to select the minimum descriptors similarly to the principal component analysis (PCA).The selection of the most informative descriptors was based on four criteria -the Jolliffe method, the sequential multiple correspondence analysis, the selection by the mean contribution orders, and the selection by the weighted average contribution orders -which are described as follows.
The Jolliffe method uses the reverse order of the descriptor for the pth factorial axis (O'p) (Jolliffe, 1973).In this case, the descriptor with the highest contribution to the last factorial axis can be discarded, considering that the importance of the main principal components or factorial axes decreases from the first to the last, and also because this descriptor explains little of the total variance.Therefore, the descriptor with the highest coefficient in the principal axis and the lowest eigenvalue in the last axis can be discarded.
The sequential MCA (SMCA) has the descriptor with the highest contribution to the last factorial axis discarded, and further analyses are carried out with the other descriptors, until their ordering is established according to their importance (Silva et al., 2013) The selection based on the mean of the contribution orders of the descriptor on the first three factorial axes of the full analysis (C3PA) along with O'p, i.e., OS = (O1 + O2 + O3 + O'p)/4 (Silva et al., 2013).
In the present work, all these analyses were performed using the FactoMineR package (Lê et al., 2008) for R version 3.0.1.
To compose the list of the most informative descriptors, a maximum of 30 descriptors was initially adopted.The efficiency of the four selection criteria in establishing the minimum list of descriptors was Table 1.List of descriptors found in the 95 accessions (K2), used for the germplasm characterization of Manihot esculenta, with the respective number of defined categories in the descriptor manual (K1) (Fukuda et al., 2010) evaluated by the correlation between the dissimilarity matrices, which was estimated for all descriptors and was calculated using only the most informative ones.The dissimilarity matrices (simple matching) for the multicategorical variables were calculated in accordance with, dij = D/(C + D), in which: i and j correspond to a pair of accessions (and j = 1, 2, ..., n); C is the number of concordant classes; and D is the number of disagreeing classes.
Correlation between the dissimilarity matrices of all descriptors versus the most informative ones, according to the different selection criteria, was used as a quality pattern for selecting descriptors.The correlation significance was assessed by the t-test and Mantel test with ten thousand simulations using the Genes software (Cruz, 2006).The list of the most informative cassava descriptors was not adopted by a single criterion for discarding descriptors, but by a comparative analysis of those recommended for disposal by most of the analyzed selection criteria.

Results and Discussion
Initially, the descriptors growth habit of stem (GroHabSt), prominence of foliar scars (ProFoSca) and length of stipules (LenSti) did not show any variation in the set of evaluated cassava accessions; therefore, they were eliminated from further analysis.
Descriptor contributions for the first three factorial axes in the complete MCA showed that approximately 62% of the descriptors with the largest contributions accounted for more than 90% of the total variability that was associated with each axis (Table 2).This result can be explained by the existence of redundancy or association between the descriptors, allowing some of them to capture the same information in the germplasm variability.
Petiole length (PetLen) is the descriptor which showed the most contribution to explaining the last axis factor (9.87%) and, in principle, has less importance for the characterization of these accessions (Table 2).In theory, this result means that the information that is associated with this variable is already covered by other descriptors which contribute to other axes.In contrast, PetLen showed a high contribution to the first three factorial axes (12 th , 10 th and 1 st largest contribution, respectively), and is therefore not suitable for disposal under this criterion.In contrast, other descriptors which showed great contribution to the last factorial axes showed also low contribution in the first axes.For instance, the shape of plant (ShPl) had the second highest contribution to the last factor axis in MCA (8.54%), and the contributions for the first three factorial axes were lower compared to those of PetLen (38 th , 15 th and 25 th , respectively); therefore, the criterion for disposal based on the higher contribution to the last axes would be correct.Additionally, the descriptor tolerance to mites (Mite) had the third largest contribution to the last factorial axis (7.52%), and had also a low contribution for the first three axes (34 th , 17 th and 35 th higher contribution, respectively).Even with these inconsistencies, the informative descriptors were selected for comparison with other disposal methods.This same behavior was observed in Capsicum spp., when selecting minimum descriptors for germplasm characterization, in which the authors decided to consider different strategies, instead of discarding descriptors based only in one criterion (Silva et al., 2013).
The 30 descriptors that were selected by SMCA criteria were used to obtain the dissimilarity matrix, which was compared with that obtained for all descriptors.Overall, there was a significant correlation (p<0.01) between the dissimilarity matrices with all of the descriptors versus those selected ones by the SMCA criterion (0.83) (Figure 1).
Considering descriptor contributions for the first three factorial axes of MCA (C3PA), some descriptors were less informative in capturing most of the genetic variation data, as follows: the color of stem cortex (ColStCor), color of apical leaves (ColApLea), peduncle position (PedPos), length of phyllotaxy (LenFil), petiole color (PetCol), flowering (Flo), resistance to anthracnose (Ant), number of leaf lobes (NLeaLo), vigor (Vig), leaf retention (LeaRet), sinuosity of the leaf lobe (SinLea), periderm: ease of peeling (PerEasPel), extent of root peduncle (ExtRooPe), leaf color (LeaCol), cortex: ease of peeling (CorEasPel), stipule margin (StiMarg), color of end branches of adult plant (ColBraAPl) and root position (RooPos) (Table 2).By withdrawing these descriptors from the analyses, the correlations between the dissimilarity matrices were significant, but the absolute value of the correlation (0.77) was lower than that obtained by the SMCA criterion (Figure 1), indicating less adjustment between the discarded descriptors and the full list.The use of different methodologies leads to some Table 2.The 30 most informative morphoagronomic descriptors of cassava ranked by: successive multiple correspondence analysis (SMCA), contribution to the first three factorial axes (C3PA), weighted criterion of contribution to the first three factorial axes (C3PAWeig), and the Jolliffe criterion (Jolliffe).
Descriptor (1)  Criterion  2) C1, contributions (%) to the last factorial axes in each analysis, from which the variable with the largest contribution to the last factorial axis of the previous analysis (SMCA) was eliminated in each cycle; Os, contributions (%) of the descriptor to the first three factorial axes of the full analysis (C3PA); Oz, contributions (%) of the descriptor to the first three axes, with weights defined by the respective eigenvalues of the full analysis (C3PAWeig); O'p: contribution (%) of the descriptor to the last factorial axis (Jolliffe).Underlined numbers refer to the selected descriptors for each criterion.
inconsistencies which should be carefully analyzed.In the study of Oliveira et al. (2012), quantitative descriptors were subjected to principal components analyses using the Singh and direct selection methods, by which eighteen and fifteen quantitative descriptors were respectively discarded by the Singh's and direct selection methods.However, considering the simultaneous analyses of these methodologies, only 60% descriptors were selected to maximize the total variation of the genotypes.Therefore, simultaneous analyses using several methods seem to be an efficient strategy to minimize errors in the elimination of descriptors.
The criterion for the descriptor selection based on the weighted mean of the contribution orders of the descriptor for the first three axes, with weights defined by the respective eigenvalues of the full analysis (C3PAWeig), indicated that the PerEasPel, shape of plant (ShPl), dry matter content (DMC), PetCol, angle of branching (AngBra), Mite, color of root pulp (ColRooPu), ColApLea, Vig, PedPos, ExtRooPe, NLeaLo, LeaCol, CorEasPel, StiMarg, ColBraAPl, LeaRet and RooPos descriptors were less informative for cassava discrimination (Table 2).Although there is a significant correlation between the dissimilarity matrix of the complete data and descriptors based on the criterion C3PAWeig (0.75), the absolute value of the correlation was still less than that of the C3PA criterion (Figure 1).Similar results were observed in Capsicum spp., in which the thirty best descriptors showed significant correlations (p<0.01)among the dissimilarity matrices, in which the correlation magnitude of C3PAWeig was lower (0.87) than C3PA (0.89) (Silva et al., 2013).The criterion based on the higher eigenvectors elements in the last components (Jolliffe) indicated that LeaRet, root diameter (RooDiam), AngBra, height to first branching (HeiFiBra), plant height (PlHei), color of stem exterior (ColStEx), yield of noncommercial roots (YiNComRoo), width of leaf lobe (WidLeaLo), resistance to bacterial blight (Bac), yield of commercial roots (YiComRoo), shape of central lobe (ShaCeLob), pubescence on apical leaves (PubApLea), ShPl, Mite, ColRooPu, number of cutting stems per plant (NCutStPl), root length (RooLen) and PetLen (Table 2) should be discarded.Based on the Jolliffe criterion, there was a higher correlation between the dissimilarity matrix of the full data and the informative descriptors (0.84), indicating a better fit in the descriptor disposal (Figure 1).In contrast, Silva et al. ( 2013) decided to consider the participation of the descriptors in the first three factorial axes (C3PA and C3PAWeig), instead of discarding variables only based on Jolliffe (1973), since this criterion showed that the descriptor "species" (Esp) was the largest contributor to explain the last factorial axis, although showing the second highest contribution to the first factorial axis, and the greatest contribution to the second and third axes.
In general, there is a mismatch between descriptors listed for disposal using the four selection methods.Only the PetLen, RooLen, NCutStPl and YiComRoo descriptors were common to all the disposal methods (Table 2).Therefore, the decision about which descriptors should effectively be discarded was based on the coincidence of at least two methods.In this case, 23 descriptors could be dropped from the analysis (PetLen, RooLen, NCutStPl, YiComRoo, HeiFiBra, PlHei, length of leaf lobe (LenLeaLo), color of stem epidermis (ColStEp), ColStEx, ShaCeLob, WidLeaLo, levels of branching (LevBra), Mite, Bac, color of root cortex (ColRooCo), external color of storage root (ExColRoo), RooDiam, shoot weight (ShoWe), PubApLea, ratio of length/width of leaf lobe (RaLenWidLea), YiNComRoo, texture of root epidermis (TexRooEp) and ShPl.However, considering that the YiComRoo, PlHei, Mite, Bac, ShoWe, YiNComRoo and ShPl descriptors are directly related to the production capacity and resistance to biotic stresses, being routinely assessed in cassava breeding programs, and that the number of accessions in the present work is only a sample of the germplasm that has been stored from the CGB (approximately 8%), the exclusion of these descriptors is not indicated.In passion fruit, Castro et al. (2012) selected the quantitative descriptors based on direct selection and Singh methods, whereas the qualitative descriptors were analyzed by correlation.These authors found that direct selection using principal component analysis pointed out eight characters to be discarded versus seven using Singh method.Therefore, to reduce inconsistencies in the elimination of descriptors, they adopted both procedures to indicate the most relevant descriptors.
Criteria for discarding descriptors showed very different directions; therefore, it is necessary to make comparisons between the methods, to critically analyze the common descriptors between the methodologies and, possibly, to place more value on the contribution of the variables on the first factorial axes.
Descriptor disposal indication according to selection methods of variables were also observed by Silva et al. (2013) for Capsicum spp.However, these authors made no comparison common descriptors to the different methods.Therefore, descriptor choice was based on the selection method of variables showing the highest correlation between the dissimilarity matrices of all descriptors versus the most informative list of descriptors.
The list of morphoagronomic descriptors associated with a higher discrimination capacity of cassava germplasm and important agronomic information to cassava breeders consisted of 32 descriptors (Table 3).The correlation between the dissimilarity matrix comparing the full and the minimum list of descriptors was significant and of high magnitude (0.81) (Figure 2).Although the correlation absolute value between the dissimilarity matrices is smaller than in SMCA methods and in that proposed by Jolliffe (1973), this list of the preliminary agronomic descriptors meets curator interests in the germplasm classification and breeder concerns in the preliminary characterization of germplasm for use in breeding programs.Considering that, in many cases, germplasm characterization for several species is linked to their use in breeding programs, it is important that some descriptors could allow a preliminary analysis of accession agronomic potential, in order to use them to generate superior segregating populations.Thus, as reported for passion fruit (Castro el al., 2012) and papaya (Oliveira et al., 2012), descriptors related to commercial traits of interest should be kept in the analysis.
The analysis of cassava collection further highlights the high proportion of informative descriptors related to root traits (preliminary agronomic descriptors).This result is interesting because these traits are used to define the potential use of cassava germplasm, either per se or as a parental in cross blocks in breeding programs.
Therefore, it is possible simultaneously to characterize these germplasm collections for the representation of genetic variability and to satisfy the immediate interests of breeders regarding the availability of readily accessions for practical use.Similarly, many of the standard Capsicum descriptors are also related to fruit traits in sweet and hot peppers, which has been the focus of the breeding of these species as for their relation with storage, processing, marketing, and the consumption of commercial derivatives (Silva et al., 2013).In this sense, if cassava germplasm characterization considers the traits of interest for breeding, the possibilities for their use are higher, as stipulated in the public policies of conservation and the use of plant genetic resources (Food and Agriculture Organization of the United Nations, 1991).
The list of the most informative morphoagronomic descriptors for cassava could reduce the number of descriptors which are currently in use in germplasm banks, without harming the representation of cassava genetic variability and the interests of breeding programs.In general, the list of the standard descriptors which were proposed in the present study (Table 3) indicates that 37% of the initial descriptors were discarded.This descriptor reduction is lower than those reported (%) in the literature, such as for: Paspalum sp., 86% (Strapasson et al., 2000); fig (Ficus carica L.), 74% (Giraldo et al., 2010); and Capsicum sp., 50% (Silva et al., 2013).

Conclusions
1.The weak relationship between the different methods of descriptors selection makes it difficult the identification of the most important descriptors for the high discrimination ability using only one method.
2. The discard of 37% of the studied cassava descriptors is possible for cassava germplasm characterization.
3. For cassava characterization, the following descriptors can be discarded: pubescence on apical leaves, shape of central lobe, color of stem exterior, external color of storage root, color of root cortex, texture of root epidermis, length of leaf central lobe, width of leaf central lobe, ratio of length/width of leaf lobe, petiole length, color of stem epidermis, growth habit of stem, height to first branching, levels of branching, prominence of foliar scars, length of stipules, number of cutting stems per plant, root length, and root diameter.
4. The 32 selected cassava descriptors account for more than 90% of the total variability without significant losses for the preliminary characterization of the cassava germplasm.

Figure 1 .
Figure1.Correlation between the dissimilarity matrices (DM), comparing all the descriptors with those selected by successive, multiple correspondence analysis (SMCA), contribution to the first three factorial axes (C3PA), weighted contribution to the first three factorial axes (C3PAWeig), and Jolliffe criteria (Jolliffe).**Significant at 1% of probability both by t and by Mantel tests.

Figure 2 .
Figure 2. Correlation between the dissimilarity matrices (DM), comparing the descriptors and the final list of the most informative selected ones.**Significant at 1% of probability both by t and by Mantel tests.
, and the code for each descriptor.