DECISION TREE AS A TOOL IN THE CLASSIFICATION OF LIMA BEAN ACCESSIONS

ABSTRACT Morpho-agronomic characterization studies aiming at the discrimination and classification of lima bean accessions in relation to the centers of domestication and biological status have been of great importance for conserving the biodiversity of this species. For this purpose, researchers have widely used the multivariate analysis called discriminant analysis, which is not always capable of producing satisfactory results. Computational intelligence-based classifiers are additional tools for understanding complex classification problems. In this study, the objective was to test the use of the decision tree in the classification of lima bean according to the centers of domestication and biological status (cultivated and wild), based on eight phenotypic traits of the seed. Sixty accessions of lima bean from the Phaseolus Germplasm Bank of Universidade Federal do Piauí (BGP / UFPI) were evaluated, and classification was performed using two approaches: conventional statistics with discriminant analysis of principal components (DAPC) and computational intelligence through decision tree (DT). The results showed that the use of DT was efficient to identify patterns in the classification of lima bean accessions, due to its comprehensibility. Seed weight was one of the main descriptors used to explain the origin and diversity of the species. The results found will be useful for studies that involve the conservation of genetic resources, mainly for the maintenance of germplasm banks and in breeding programs. In addition, it is recommended to integrate machine learning algorithms in studies aimed at classifying lima bean.


INTRODUCTION
The species Phaseolus lunatus L., popularly referred to as lima beans, can be found in the form of two botanical varieties: P. lunatus var. lunatus, which includes domesticated populations, and P. lunatus var. silvester, made up of wild populations (BAUDET, 1977). Classification of this species is carried out according to its geographical origin and seed characteristics.
Studies conducted by Motta-Aldana et al. (2010), from the chloroplast DNA of 109 lima bean accessions, pointed to two major domestication centers: 1) Andean (A), located in Ecuador and northern Peru, in which the plants have large seeds and are adapted to high altitudes; and 2) Mesoamerican (M), which extends from centralwestern Mexico to Honduras, where plants have small seeds and occur at lower altitudes.
Discrimination of populations has been of great importance for biodiversity conservation and the development of breeding programs. Thus, analyses of genetic diversity, through phenotypic or genetic characteristics, have guided the choice of appropriate parents in breeding programs, leading to the optimization of selective gains, due to the variability found in the offspring of crosses between divergent groups (SANT'ANNA et al., 2018).
One of the techniques that allow allocating a new individual to one of the previously known distinct populations is the multivariate analysis called discriminant analysis (NOGUEIRA et al., 2008). However, biometric analyses are not always able to produce satisfactory results, mainly because the populations analyzed are not sufficiently divergent or the quantity and quality of the variables used in the study are not adequate.
In this context, conducting analyses through methods based on computational intelligence, which are able to go through stages of learning and generalization from all information, being tolerant to noise, represents a major advance for studies involving statistical procedures and for genetic improvement (SANT'ANNA et al., 2018).
Artificial intelligence has allowed a new approach in the decision-making process in several areas of agriculture (LIAKOS et al., 2018), with great potential in the conservation of genetic resources.
Among the techniques of artificial intelligence, decision tree (DT) stands out as they are widely used in various fields, mainly in machine learning, to solve complex classification problems.
Its success is adequately explained by the ability to provide simple representations, which are easily understandable by experts and even by ordinary users (TRABELSI; ELOUEDI; LEFEVRE, 2019).
Several studies show that the traits related to the seed are one of the main contributors to the understanding of genetic diversity and origin of the species (CHACÓN-SÁNCHEZ; MARTÍNEZ-CASTILLO, 2017;SILVA et al., 2017). Considering that DTs are attractive for practical applications due to their comprehensibility and that studies using machine learning with the species P. lunatus are still incipient, the aim of this study was to apply the decision tree in the classification of lima beans according to the centers of domestication and biological status (cultivated and wild) of the species, by means of phenotypic traits of the seed.

MATERIAL AND METHODS
The 60  Eight quantitative descriptors were measured, as recommended by the International Plant Genetic Resources Institute (IPGRI, 2001), and their respective measurements, within parentheses, were: 1) seed length (SL, mm); 2) seed width (SW, mm); 3) seed weight (100SW, g); 4) seed thickness (ST, mm); 5) seed area (SAR, mm 2 ); 6) seed length perimeter (SLP, mm); 7) seed length to width ratio (L/W); 8) seed thickness to width ratio (T/W). The measurements were taken with a digital caliper and using Smartgrain software (TANABATA et al., 2012;ASSUNÇÃO-NETO et al., 2018). Classification of accessions was performed based on DAPC and on the probability of association, using the a priori information of the country of origin and biological status of the accession (Table 1). For the application of the decision tree (DT) algorithm, the response variable was the classification presented in Table 1 and the predictor variables were the measured phenotypic traits of the seed (Figure 1), the division index used was Gini (BREIMAN et al., 1984)

RESULTS AND DISCUSSION
The principal component analysis, performed from the means of the eight quantitative traits evaluated, provided the data of the eigenvalues (variance) of each principal component, the percentage of explained variance and accumulated proportion ( Table 2). The first two principal components (PCs) were sufficient to explain 91.34% of the total variation contained in the set of phenotypic traits of the seed. 1 The discriminant analysis of principal components (DAPC) seeks to obtain functions that make it possible to classify an accession, from information of a set of measured characteristics, one among several known groups, seeking to minimize the probability of poor classification (CRUZ; REGAZZI; CARNEIRO, 2014).
The scatter plot (Figure 2) that corresponds to the grouping based on the discriminant analysis of principal components (DAPC), plotted from the first two principal components, used information of the country of origin and biological status of the accession (Table 1). Each color symbolizes the groups obtained a priori and, therefore, provides the Rev. Caatinga, Mossoró, v. 34, n. 2, p. 471 -478, abr. -jun., 2021 475 possibility of verifying the consistency of this grouping in relation to DAPC. Graphic analysis (Figure 2) showed the dispersion of lima bean accessions and the separation according to the biological status (cultivated and wild). It was also possible to observe that there is a lot of overlap between the wild groups (WA and WM), showing a certain proximity between the groups, which refers to a certain degree of difficulty in pointing out a difference between them. The graph of the association probability (Figure 3), that is, the probability of an accession belonging to the group to which it was classified based on country of origin and biological status (Table 1) Group I, cultivated Andean, had high values for the probability of association, except for the accessions UFPI 1124, UFPI 1095 and UFPI 1105, which had low values for the traits related to the seed, indicating characteristics associated with the Mesoamerican domestication center. Some possible explanations for these exceptions within each group are: the natural hybridization of wild and domesticated forms, which are observed throughout Latin America, generating intermediate forms (BAUDOIN et al., 2004), or the phenotypic plasticity of the crop.
Group III, cultivated Mesoamerican, also showed a high probability of association with values higher than 90%, except for the accession UFPI 1102. The contrast of this accession in group III is mainly due to the high values for the traits related to the seed, except for thickness. This causes the accession to be highly associated with group I, cultivated Andean, since larger and heavy seeds identify the Andean domestication center (MOTTA-ALDANA et al., 2010).
Groups II and IV showed a lot of mixture, because they are of the wild biological status, with very similar phenotypic values related to the seed, which made differentiation difficult. Some possible explanations for the efficiency of discrimination have not been so high are: I) the groups are not in fact so differentiated and consequently there will be overlap; II) the quantity and quality of the variables are not sufficient to differentiate the groups; III) the statistical approach used for discriminating the accessions is not adequate.
The decision tree (DT) (Figure 4) was used as a tool for better dissection of the classification of lima bean accessions regarding domestication centers and biological status of the crop, based on phenotypic traits of the seed. According to the results, the DT demonstrated that, if the seed has an area larger than or equal to 75 mm 2 , it is classified in the cultivated biological status, otherwise it is classified as wild. When the seed classified as of cultivated biological status has 100SW higher than or equal to 56 g, it is classified in the cultivated Andean (CA) group of seeds and, if 100SW is lower, it is classified as cultivated Mesoamerican (CM). Otherwise, when the seed has an area smaller than 75 mm 2 and has a thickness greater than or equal to 3.2 mm, it is classified as WA and, if it is smaller, the seed is classified as WM. According to the decision tree, the most important variable for the classification of a new lima bean accession with respect to the domestication center and biological status was seed weight (100SW) ( Figure 5). Seeds classified as of cultivated biological status and with 100SW greater than or equal to 56 g are classified in the group of cultivated Andean seeds, while lighter seeds are related to the Mesoamerican domestication center (Figure 4).
The traits seed length to width ratio (L/W) and seed thickness to width ratio (T/W) contributed little and can be discarded in new studies on classification of this species ( Figure 5). The results obtained corroborated the literature, demonstrating that 100SW is one of the main descriptors used to explain the origin and diversity of lima beans (GUTIÉRREZ-SALGADO; GEPTS; DEBOUCK, 1995;MOTTA-ALDANA et al., 2010;CHACÓN-SÁNCHEZ;MARTÍNEZ-CASTILLO, 2017). These studies found that larger seeds (weight ranging from 58 to 122 g / 100 seeds) are classified in the Andean domestication center.
1 Figure 5. Importance of the phenotypic variables of the seed used in the decision tree (DT) for the classification of a new lima bean accession. 100SW -seed weight; SAR -seed area; SLP -seed length perimeter; SL -seed length; SW -seed width; ST -seed thickness; L/W -seed length to width ratio.
The study conducted by Silva et al. (2017), with the objective of characterizing 166 lima bean accessions cultivated in Brazil, using the Ward -MLM (Modified Location Model), in order to analyze the organization of genetic diversity and origin of this species, detected high genetic variability. The traits seed length and width were the main contributors to the genetic divergence between the evaluated accessions. The traits SL and SW were also important in the construction of the DT. The confusion matrix was obtained to evaluate the accuracy of the DT, and it demonstrated an 85% accuracy efficiency.
By evaluating the use of discriminant analysis of principal components (DAPC) and the decision tree (DT) in the lima bean classification study, it can be seen that the use of DT enabled a greater capacity to understand the logic behind the differentiation between the accessions. This methodology was also efficient to classify wild lima bean accessions. According to Piltaver et al. (2016), the decision tree is one of the most understandable machine learning techniques to solve complex classification problems. This approach has considerable potential for studies involving the conservation of genetic resources, mainly for the maintenance of germplasm banks of the species and in breeding programs of the crop. The explanation of the differentiation between groups is a subject of great relevance in the conservation of genetic resources, in the identification of duplicates, and in genetic improvement, when the objective is to establish heterotic groups and identify hybrid combinations of higher vigor.

CONCLUSIONS
Artificial intelligence-based methods are important alternatives in studies aimed at discrimination of populations.
The use of decision tree in the classification of lima beans in the domestication centers and biological status (cultivated and wild) of the crop, based on phenotypic traits of the seed, proved to be efficient.
Seed weight is a key character that can be used to clarify the origin and diversity of lima beans, considering that seeds classified as of cultivated biological status and having 100SW greater than or equal to 56 g belong to the Andean domestication center, while lighter seeds are related to the Mesoamerican center.