SSR-based genetic analysis of sweet corn inbred lines using artificial neural networks

Studies on genetic diversity and population structure provide basic information at the molecular level, which is a key input for breeding programs of crop species. This study evaluated the genetic diversity of 12 elite lines of sweet corn, using 20 microsatellite markers. To determine the genetic differentiation among lines, we used an artificial neural network with the self-organizing map (SOM) algorithm. This algorithm identified three genetically differentiated groups and produced relatively more accurate results than UPGMA, according to the indices of Davies-Bouldin and RMSSTD (Root Mean Square Standard Deviation). The expected heterozygosity was high (He>0.5) for 90% and the polymorphism information content high (PIC>0.6) for 40% of the SSR loci, indicating their potential to detect genetic differences among lines. The high genetic differentiation, detected by the neural network procedure, would allow the selection of promising divergent sweet corn genotypes.


INTRODUCTION
Maize (Zea mays L.) is the most cultivated cereal in the world, due to its importance for human and animal nutrition.In addition, corn has many industrial applications such as the production of ethanol, oil and high-amylose starch.Consequently, several breeding programs have been undertaken to achieve genetic gains in several traits of interest (e.g., Kulka et al. 2018).Studies of diversity and genetic structure allow plant breeders to investigate the population variability and thus provide basic information at the molecular level (Ballesta et al. 2015, Mora et al. 2015, Contreras-Soto et al. 2017); a key aspect in maize breeding programs (Saavedra et al. 2013, Amaral et al. 2016).
Information on population structure and genetic diversity provides crucial inputs for breeding of crop species including corn.A key molecular marker type in genetic studies are the microsatellite markers (single sequence repeats, SSR) because of their high levels of both polymorphism and number of alleles per locus (Mora et al. 2017).In corn, SSR markers have been used in the analysis of genetic diversity (Lopes et al. 2015), studies of population structure and mapping of quantitative trait loci.
Analyses of genetic clustering based on molecular markers are a simple and powerful tool to determine the population structure (Peña-Malavera et al. 2014).Currently, the main clustering techniques use the Markov Chain Monte Carlo (MCMC) algorithms to fit a model to molecular data.For instance, Pritchard et al. F Ferreira et al. (2000) proposed a Bayesian clustering method, which assumes that the populations are in Hardy-Weinberg equilibrium (HWE).On the other hand, Gao et al. (2007) proposed an alternative method to analyze population structure that relaxes the assumption of HWE in the underlying populations.Other methods have been used as an alternative to MCMC, such as principal component and artificial neural network analysis (ANN, e.g.Barbosa et al. 2011).The objectives of this study were to examine the genetic diversity of 12 elite sweet corn lines using 20 microsatellite markers and to determine their genetic differentiation using ANN.

MATERIAL AND METHODS
Twelve parental lines of sweet corn of an elite line group of the Maringá State University and Syngenta Seeds Ltd. (Werle et al. 2014) were genetically evaluated with 20 SSR markers obtained from the Maize Genetic Data Bank (http:// www.maizegdb.org)(Table 1).The genomic DNA was extracted using the protocol described by Gawel and Jarret (1991) with minor modifications, in young leaves from agricultural fields in Cascavel and Mauá da Serra, Paraná State of Brazil.The polymerase chain reaction (PCR) amplification was performed by the Touchdown PCR program (Don et al. 1991), using volumes of 20 µL, containing 25 ng of DNA, with 2.0 µL of 10 × reaction buffer, 2.5 mM MgCl2, 0.1 mM of each dATP, dGTP, dCTP, dTTP, and 0.3 mL of each primer (F and R primers) and 1 U Taq-DNA-Polymerase (Invitrogen).After amplification, 20 µL per sample (a total of 120 aliquots) were separated by electrophoresis on 10% (w/v) denaturing polyacrylamide gel.All 120 samples amplified per SSR primer were run with 1X TBE at 80 V for 18 h.A low range DNA ladder (Thermo Scientific) was used as a molecular weight marker reference.Gels were visualized under ultraviolet transilluminator and photographed using the Kodak 1D 3.5 program.The numbers of alleles per locus were determined based on their relative position on the polyacrylamide gel.
The population differentiation was inferred based on an ANN approach of the Self-Organizing Map algorithm (SOM, Kohonen 1998).Additionally, SOM results were compared with: 1) principal coordinate analysis (PCoA) implemented in GenAlex 6.5; and 2) the Unweighted Pair Group Method with Arithmetic Mean (UPGMA), according to the default  (2015).Nei's genetic distances between inbred lines were used to create the UPGMA dendrogram, and their reliability was assessed by bootstrapping.The allelic frequency (calculated in GenAlEx) was used to start the learning process of SOM.
The SOM is an unsupervised learning algorithm able to reduce very high dimensional data into patterns that can be usefully interpreted (Kohonen 1998).This method consists of two layers of artificial neurons (or nodes): an input layer (data) with "p" = 1, 2, …, "r" (one for each molecular marker) and an output layer consisting of a two-dimensional map with "a" neurons, established in a hexagonal grid (Paini et al. 2010).The procedure implemented in this study can be summarized in the following steps: (A) Starting weight vectors "w", taking random values from the input vectors "p".

Genetic diversity of markers
The microsatellites used in this study yielded 228 alleles, with a mean value of 3.85 alleles per locus.The number of alleles per locus ranged from two to six.Only 5% of all marker data were missing due to amplification failure or null alleles.The mean Ho value in the SSR loci was low (0.088), a result expected in pure lines, and reached a maximum in locus umc1549 (0.75).The genetic diversity was equivalent to the expected heterozigocity for diploid data, which is defined as the probability that two chosen randomly haplotypes (alleles) are different in the sample (Sserumaga et al. 2014).The expected heterozigocity (He) ranged from 0.279 (umc2292) to 0.806 (bnlg1371), with an average of 0.637 (Table 1).These values were significantly correlated with the number of alleles (r = 0.764), though the highest He values had two loci (bnlg1371 and umc1137) with six alleles.Ninety percent of the loci had high He (>0.5), indicating their adequacy to differentiate sweet corn inbred lines.
A high level of differentiation was found in the 12 sweet corn lines (Fst = 0.897, Table 1), where the Fst per locus ranged from 0.431 (umc1549) to 0.998 (mmc0181, bnlg2191, umc1636 and umc2308).The Fst values indicated that F Ferreira et al.
89.7% of the total variation in the locus allele frequency was due to genetic differences among the lines under study (Eloi et al. 2012).The genetic diversity found in this study was similar to that reported by Eloi et al. (2012) and Sserumaga et al. (2014).Forty percent of the loci had high PIC values (>0.6), indicating a great potential to detect differences among pure lines, confirming previous studies (Sserumaga et al. 2014).

Population structure analysis
The SOM algorithm, based on the allelic frequency of SSR loci, showed that the 12 parent lines formed three genetically differentiated groups (Figure 1).The first included half of the lines (6/12), and the second and third group contained three lines each.Similar to the findings reported by Kohonen (1998), the clustering results from the neural network agreed with those of the PCoA analysis (Figure 2).Groups 1 and 2 in the UPGMA dendrogram were strongly supported by bootstrapping, while group 3 had low bootstrap values, indicating a low level of confidence.Lines L1 and L5 were grouped differently by the UPGMA (Figure 3) than by the former methods.However, the RMSSTD value for SOM clustering was relatively lower than the RMSSTD value of UPGMA (0.36 and 0.38, respectively), indicating higher homogeneity in the SOM clusters.Similarly, the DB index was higher by the UPGMA method (DB=1.93),indicating lower precision than by the SOM algorithm (DB=1.82).These findings agree with a previous study of Peña-Malavera et al. (2014), who reported higher error rates of the UPGMA method than of SOM and PCoA, because it produces highly unbalanced clusters.
The clustering analysis using neural networks (via SOM) offers a faster alternative of identifying genetic clustering than the MCMC methods, as highlighted by Nikolic et al. (2009).As similarly found in previous studies (Barbosa et al. 2011, Peña-Malavera et al. 2014), neural networks have good adaptation to multi-allelic data and provide precise results in the identification of genetically differentiated groups.Finally, the high genetic differentiation detected among maize lines would allow the selection of promising divergent genotypes in the current breeding program of sweet corn.
(B) Calculating the Euclidean distance between p and w. (C) Assigning each p with the closest w, based on the distance results.(D) Updating w from the assigned p. (E) Repeating steps B, C and D until achieving convergence (Kohonen 1998).The Root Mean Square Standard Deviation (RMSSTD, Grover and Vriens 2006) and the Davies-Bouldin index (DB; Davies and Bouldin 1979), both computed using functions from the ClusterSim library of the R project, were used to test the procedure accuracy.

Figure 1 .
Figure 1.Results from the clustering procedure based on artificial neural network analysis of 20 sweet corn inbred lines, which evidenced three genetically differentiated groups.

Figure 3 .
Figure 3. Cluster dendrogram of 20 sweet corn inbred lines, evaluated with SSR markers.The tree was constructed using the unweighted pair group method with arithmetic average (UPGMA) based on Nei's genetic distances.Values at the nodes indicate a percentage of 10,000 bootstrap runs supporting a particular node.

Figure 2 .
Figure 2. Principal coordinate analysis of genetic distances of 20 sweet corn inbred lines.Symbols denote inbred lines belonging to a particular genetically differentiated cluster, according to the artificial neural network model.

Table 1 .
SSR locus information, number of alleles per locus, expected heterozygosity (He), polymorphic information content (PIC) and coefficient of fixation (Fst) in 12 sweet corn inbred lines evaluated by microsatellite markers.Felsenstein 1989).Principal coordinate analysis was based on standardized covariance of genetic distances calculated for codominant markers (option DISTANCE, sub-option GENETIC), according toMora et al.