Computational intelligence for studies on genetic diversity between genotypes of biomass sorghum

The objective of this work was to evaluate the potential of computational intelligence and canonical variables for studies on the genetic diversity between biomass sorghum (Sorghum bicolor) genotypes. The experiments were carried out in the experimental field of Embrapa Milho e Sorgo, in the municipalities of Nova Porteirinha and Sete Lagoas, in the state of Minas Gerais, Brazil. The following traits were evaluated: days to flowering, plant height, fresh biomass yield, total dry biomass, and dry biomass yield. The study of genetic diversity was performed through the analysis of canonical variables. For the recognition of the organization pattern of genetic diversity, Kohonen’s self-organizing map was used. The use of canonical variables and a self-organizing map were efficient for the study of genetic diversity. The application of computational intelligence using a selforganized map is promising and efficient for studies on the genetic diversity between biomass sorghum genotypes.

For the generation of sorghum hybrids, three types of lines -called A, B, and R -are required (House, 1985). The A and B lines are isogenic and differ only for cytoplasm: A line has cytoplasm that confers the male sterility phenotype, when associated with recessive nuclear genes for restoration of fertility; and B line has normal cytoplasm, and the plant has the fertile male part, even with nuclear recessive alleles for fertility restoration (Smith & Frederiksen, 2000). Thus, the hybrid is obtained by crossing the Aline (female) with an R line (restorative) that has dominant alleles for the fertility restoration gene.
Traditional genetic diversity studies, in which a group of different accessions are evaluated, are carried out with the purpose of identifying contrasting and higher potential parents that will form the base population, or hybrids (Cruz et al., 2011). However, the study of genetic diversity allows of other inferences, such as the possibility of recognizing inheritance patterns, especially in situations in which the cluster represents the proximity of hybrid combinations and their parents. In this situation, the clustering pattern can reflect the similarity of parents, and also important genetic phenomena, such as dominance and epistasis that produce similar phenotypes from different genotypic combinations (Ferrão et al., 2002).
To determine how far one population or genotype is from another, biometric methods based on the quantification of heterosis, or on predictive processes of heterosis, are used. Predictive methods are usually quantified by measurements of dissimilarity (Cruz et al., 2012). Several methods may be used, including principal component analysis, canonical variables, and agglomerative methods (Guidoti et al., 2018). The choice of the method relies on the goal of the study, desired precision by the breeder, ease of analysis, and way the data are obtained (Cruz et al., 2012).
The computational intelligence approach aiming at the genetic improvement can be used to assist breeder decision-making (Crain et al., 2018;Montesinos-López et al., 2018;Harfouche et al., 2019). From a database of well conducted experiments that characterize accessions in different environmental conditions, genotypes can be selected more efficiently by artificial neural networks (ANN), particularly to recognize the organization of this diversity in analyses based on selforganizing map (SOM) (Barbosa et al., 2011;Santos et al., 2019;Wolski & Kruk, 2020). SOMs are a type of ANN developed by Kohonen (1982) that are trained by using unsupervised learning to project highdimensional, complex data into a two-dimensional grid (Kohonen, 1982(Kohonen, , 2014Augustijn & Zurita-Milla, 2013). In addition, SOMs have a nonlinear structure in which they can capture more complex data characteristics, either quantitative or qualitative, which is not always possible using traditional statistical techniques (Galvão et al., 1999;Santos et al., 2019).
Preliminary studies are performed from the population information (in the present study, regarding parents and hybrids) previously known in network training stage. Then, the efficiency of network discrimination with the adjusted structure for recognition purposes of organizational pattern is verified, allowing of the inferences on the phenotypic similarity from distinct genotypes.
The objective of this work was to evaluate the potential of computational intelligence and canonical variables for studies on the genetic diversity between biomass sorghum genotypes.

Materials and Methods
The experiments were carried out in the experimental field of Embrapa Milho e Sorgo, in the state of Minas Gerais (MG), Brazil, in the municipalities of Nova Porteirinha, northern of the state (15º48'S; 43º18'W), and Sete Lagoas (9º27'S; 44º15'W). Four experiments were conducted in different crop years and locations (Table 1), as also described by Silva et al. (2018).
In the crop year 2016/2017, 30 hybrids, the parent lines, and six non-mutant controls for the bmr gene (201636B005, 201636B006, 201636B008, 201636B019, 201636B004, and 'BRS 716') were evaluated in a triple lattice (7x7) design, at the experimental unit of Embrapa Milho e Sorgo, in Nova Porteirinha, MG. The sowing occurred on October 27, 2016. In the same crop year 2016/2017, 30 hybrids, the parent lines, and the same six non-mutant controls for the bmr gene, were also evaluated in a triple lattice (7x7) design, at the experimental unit Embrapa Milho e Sorgo, in Sete Lagoas, MG. The sowing occurred on November 21, 2016.
In the 2017/2018 crop year, 40 hybrids, the parent lines, and two non-mutant controls for the bmr gene ('BRS 716' and N52K1009) were evaluated in an alphalattice (7x8) design, at Embrapa Milho e Sorgo, in the municipality of Nova Porteirinha, MG. The sowing occurred on October 28, 2017. In the same 2017/2018 crop year, 40 hybrids, the parent lines, and two nonmutant controls for the bmr gene ('BRS 716' and N52K1009) were evaluated in an alpha-lattice (7x8) design, at Embrapa Milho e Sorgo, in the municipality of Sete Lagoas, MG. The sowing occurred on October 26, 2017.
The plots of four experiments consisted of double 3 m rows, spaced at 0.70m. Plants were thinned two weeks after seedling emergence, maintaining a population of approximately 110,000 plants per ha. Fertilization was performed with 450 kg ha -1 of the formulation 08-28-16 (N-P 2 O 5 -K 2 O) applied in the planting grooves, plus 200 kg ha -1 urea as topdressing at 20-25 days after sowing. Weed control was performed by the application of atrazine and manual weeding. The other cultural treatments related to pests and disease control were carried out following the recommendations for the culture in each region (Coelho, 2015). Plants were harvested when they reached the physiological maturity of the grain.
The following traits were evaluated: days to flowering (FLOW, number of days), which consists of the days between sowing and the pollen liberation of 50% of the plants in the plot; plant height (PH, m), which is the mean height of the plants within the plot, measured from the soil surface to the top of the panicle; fresh biomass yield (FBY, Mg ha -1 ), which was determined by weighing all plants of the useful area; total dry biomass (DB, %), which is the quantification of the dry matter content, by weighing samples of each treatment, and determining the green weight; green weight (GW). Samples were then stored Table 1. Biomass sorghum (Sorghum bicolor) parents and hybrids used in the partial diallel in experiments I, II, III and IV (1) .
The joint analysis of variance was performed to evaluate the traits, in Nova Porteirinha and Sete Lagoas, in the 2016/2017 and 2017/2018 crop years, for the common genotypes, considering the effects of treatment and environment as fixed, according to the following equation: y ijk = μ + g i + e j + ge ij + b/e jk + ε ijk in which: y ijk is the observed phenotypic value of the i th genotype, in the k th block, within the j th environment; g i is the effect of the i th genotype; e j is the effect of the j th environment; ge ij is the effect of the interaction of the i th genotype with the j th environment; b/e jk effect of the k th block within the j th environment; and ε ijk is the effect of the experimental error.
The study of genetic diversity was performed by graphic dispersion through canonical variables analysis, which is evidenced by the dispersion of scores in graphs, with the axes represented by the first two canonical variables. Thus, when determining the number of canonical variables that accumulate a minimum of 80% of the total available variance, we estimate the scores for each canonical variable that can be plotted on two or three dimensional graphs, and the canonical variables are used as reference axes, in which graphical distances that represent similarities and dissimilarities between genotypes can be visualized (Piassi et al., 1995).
To recognize the organization of diversity, Kohonen's self-organizing maps (SOMs) were used. Different network architectures were tested by varying the number of rows (1 to 5) and columns (1 to 5). Kohonen (2001) points out that determining the number of neurons and learning parameters is an empirical process based on the researcher's experience, and trial and error methods. Then, to select the best network architecture, 2,000 trainings were done for each of the combinations. The defined topology was hextop, that is, with hexagonal neighborhood; the distance used to configure the artificial neural networks was the Euclidean distance. The software Matlab (Matlab, 2012) and Genes (Cruz, 2013) were used to perform the analysis.

Results and Discussion
In the four experiments, there were effects of genotypes (hybrids, parents, and controls) for most traits, in both crop years, except for dry biomass in 2017/2018, showing wide variability between genotypes ( Table 2). Estimates of coefficients of variation (CV%) indicated an adequate experimental precision, according to other studies on biomass sorghum (Silva et al., 2018;Oliveira et al., 2019). There was a significant effect for genotype x environment interaction (G x E) for all traits, in both crop years, indicating different behavior of genotypes at different environments. Thus, the genetic diversity between genotypes was studied separately for each environment, since the clustering pattern can be changed according to the environmental variation. In Nova Porteirinha (first crop year, 2016/2017), the first two variables explained 92.94% of the total variation, according to values associated with the canonical variable (Table 3). When the first two canonical variables explain over 80% of the total variation, their use is satisfactory in the study of genetic diversity, by assessing the graphic dispersion of scores in relation to the canonical variables CV1 and CV2 (Cruz et al., 2012). In Sete Lagoas (2016/2017), the first two variables explained 97.45% of the total variation. In the second crop year (2017/2018), the first three variables explained 94.36% of total variation in Nova Porteirinha, and 94.71% of total variation in Sete Lagoas. Based on these results, the two-dimensional graphic dispersion using CV1 and CV2 was made (Figure 1).
In all the evaluated experiments, the A lines were observed as more divergent than the other lines ( Figure 1). This can be verified because these lines are male-sterile, carrying the bmr-6 allele, of low height and insensitive to photoperiod. In contrast, R lines and hybrids were closer to each other, showing that R lines contribute to a higher concentration of favorable and dominant alleles for the formation of hybrids. In addition, the R lines carry the bmr-6 allele, tall, with excellent biomass yield, and sensitive to photoperiod (Oliveira et al., 2019).
When transmitting their gametes, the genotypes with high concentration of dominant alleles leave more similar progeny than those with high concentration of recessive allele. In addition, dominant favorable alleles provide larger numbers of superior genotypes in the segregating generation, allowing of the identification of larger individuals with smaller population size. However, it is necessary to consider that the trait plant height, in sorghum, is controlled by four independently inherited Dw (Dwarf) genes, viz., Dw1, Dw2, Dw3, and Dw4 (Hilley et al., 2016;Shukla et al., 2017). Plant height is determined by the interplay of the internode length and the number of nodes that the plant produces before flowering. The Dw genes have partial dominance for tallness, and their effects are additive in nature (Shukla et al., 2017). There may be also recessive epistasis in the manifestation of the combined additive effect of two or more genes for the other evaluated traits. That is, the observed phenotype of hybrids may not be explained and, in this case, the combined effect of these genes on a phenotype cannot be predicted from the sum of their individual effects (Leite et al., 2020).
For self-organizing maps, it was found that the network architecture using five columns and five rows for the four experiments was efficient ( Figure 2). Several studies using the SOM network have also defined their topology either tentatively or randomly (Barbosa et al., 2011;Chaudhary et al., 2014;Gámez Albán et al., 2016;Santos et al., 2019). Therefore, the method for finding the best network architecture is very important because each time SOM networks are used, different results can be obtained, since the networks have random synaptic weights at the beginning of the training (Santos et al., 2019).
The lines were organized in different neurons, which are represented by the hexagons. Filling in the area of the hexagons indicates the concentration of  (Table 4), through the organization made by the Kohonen's maps. The topology used was able to organize the genotypes in groups closer to those obtained by the canonical variables ( Figure 1). The SOM method proved to be efficient in the identification of similarity patterns and in the organization of the proximity of genotypes between groups, according to the grouping carried out, as shown by Santos et al. (2019), who used the SOM technique as an alternative method to evaluate the genetic diversity in plant breeding programs. However, in cases with higher numbers of neurons,  there is a possibility of larger variation amounts in the allocation of genotypes, as affirmed by Kohonen (2014). Thus, the use of artificial neural networks as pattern recognition methods is a promising way, and the SOM network can provide more valuable results than traditional cluster analysis.

Conclusion
The application of computational intelligence using a self-organized map is promising and efficient for studies on the genetic diversity between biomass sorghum (Sorghum bicolor) genotypes.