Genetic diversity and heterotic grouping of sorghum lines using SNP markers

Sorghum breeding programs are based predominantly on developing homozygous lines to produce single cross hybrids, frequently with relatively narrow genetic bases. The adoption of complementary strategies, such as genetic diversity study, enables a broader vision of the genetic structure of the breeding germplasm. The purpose of this study was to evaluate the genetic diversity of sorghum breeding lines using structure analysis, principal components (PC) and clustering analyses. A total of 160 sorghum lines were genotyped with 29,649 SNP markers generated by genotyping-by-sequencing (GBS). The PC and clustering analyses consistently divided the R (restorer) and B (maintainer) lines based on their pedigree, generating four groups. Thirty-two B and 21 R lines were used to generate 121 single-cross hybrids, whose performances were compared based on the diversity clustering of each parental line. The genetic divergence of B and R lines indicated a potential for increasing heterotic response in the development of hybrids. The genetic distance was correlated to heterosis, allowing for the use of markers to create heterotic groups in sorghum.


Introduction
Sorghum bicolor (L.) Moench is the fifth most important cereal cultivated worldwide after corn, rice, wheat, and barley, and is mainly grown in semiarid tropical regions for food and fodder (FAO, 2019). The genus Sorghum presents broad genetic diversity, including wild and cultivated species, divided into five basic morphological races (bicolor, caudatum, durra, guinea and kafir) and into ten other intermediate races, which are various combinations involving the five basic races (Harlan and De Wet, 1972).
Hybrid production in sorghum relies on the cytoplasmic-genetic male sterility system (CMS). The A-line (female) is a male sterile line in A1 cytoplasm, generated by backcrossing a maintainer line (called B-line) in normal cytoplasm, to generate a female parent with the maternally inherited A1 cytoplasm. Restorer lines (R-lines) carry dominant nuclear genes to restore the hybrid fertility with the A1 cytoplasm. The fertile F1 hybrid is produced by crossing A-with R-lines (Jordan et al., 2010(Jordan et al., , 2011Klein et al., 2008;Mindaye et al., 2015). Thus, A/B-and R-lines were used to differentiate the parental pools in a sorghum breeding program (Menz et al., 2004;Mindaye et al., 2015).
Detailed analyses using molecular markers indicated the existence of genetic relationships between elite parental lines of sorghum (Jordan et al., 2010(Jordan et al., , 2011Menz et al., 2004). Several molecular markers are used to explore genetic diversity in sorghum, including single nucleotide polymorphisms (SNPs) (Elangovan et al., 2014;Geleta et al., 2006;Billot et al., 2013;Lekgari and Dweikat, 2014;Silva et al., 2017). SNP markers have a number of advantages such as abundance along the genome and potential for high throughput analysis (Varshney et al., 2009).
Studies with genetic distance estimates are important because they contribute to the assigning of genotypes to heterotic groups in hybrid development from different intergroup crosses (Brown et al., 2011;Ramu et al., 2013). Information on genetic diversity and heterotic groups is very useful to both the development of inbred lines and plant breeders as regards utilization of their germplasm in a more efficient and consistent manner through the exploitation of complementary lines which maximize the outcome of hybrid breeding programs (Mindaye et al., 2015). In this context, the purpose of this study was to assess the genetic diversity of grain sorghum B-and R-lines, and to estimate correlations between the genetic distance of lines and the magnitude of heterosis in hybrids.

Genetic material
A total of 160 grain sorghum lines were selected based on days to flowering, plant height and resistance to biotic and abiotic stresses. These genotypes are used as elite lines of the sorghum breeding program, including 109 restorer lines (R-lines) and 51 maintainer lines (B-lines). The experiments were conducted in Sete Lagoas, Minas Gerais, Brazil (19°27'57'' S, 44°14'48'' W, altitude of 751 m).

Molecular marker data
Genomic DNA was extracted from the young leaves of one plant representing each inbred line based on the cetyl trimethylammonium bromide method (Saghai-

Genetics and Plant Breeding
Research Article Genetic diversity in grain sorghum Sci. Agric. v.78, n.6, e20200039, 2021 Maroof et al., 1984). DNA quality and quantity were evaluated in gel in a Trisacetate-EDTA buffer, stained with GelRed and using a Fluorometer. Genotyping-bysequencing (GBS) (Elshire et al., 2011) was performed at Ithaca, NY, USA, with the restriction enzyme ApeKI and 96 samples per sequencing lane.
SNPs were called using the GBS pipeline, available in the TASSEL software program (version 5.0). Subsequently, SNP markers were filtered for the minor allele frequency (MAF) ≥ 5 %, missing genotypes ≤ 20 %, and for a proportion of heterozygotes per locus below 5 %.

Analysis of genetic differentiation in sorghum lines
Diversity analysis was conducted using SNP data for 160 sorghum lines. For each SNP, the minimum allele frequencies and the polymorphic information content (PIC) were calculated. PIC reports the discriminatory power of the marker, when considering not only the number of alleles per locus but also its relative frequencies (Botstein et al., 1980), which is expressed by: where l is the number of alleles per locus; p i and p j the estimated frequencies of the i th and j th alleles, respectively, which were calculated using TASSEL (version 5.0).
An analysis of the population structure was conducted for the identified groups of sorghum lines using the Bayesian model-based clustering algorithm in the STRUCTURE software program (version 2.2). The admixture model with correlated allelic frequencies was used assuming regions of the genome in common across groups for each line (Falush et al., 2003). The model was run for the burn-in period of 1 × 10 4 with Markov Chain Monte Carlo (MCMC) replicates of 1 × 10 4 for ten iterations for each population size (k = 1-10). The size of the population (k) was determined by the estimated logarithm of likelihood Ln P(D) for each subpopulation, where the lower variance between runs was considered as the appropriate population size (Casa et al., 2008), based on the second-order rate of change in likelihood (ΔK) (Evanno et al., 2005).
Principal component analysis (PCA) was conducted based on a marker-based similarity matrix, using TASSEL (version 5.0). Graphical plotting was obtained using the ggplot2 package (Wickham, 2016) in the R statistical software language program.
The genetic distances between pairs of sorghum lines were calculated based on the SNP data using the identical-by-state (IBS) coefficient (Powell et al., 2010) with TASSEL (version 5.0). Clustering was performed using the Neighbor-Joining method (Saitou and Nei, 1987) with the software Power Marker 3.25 tool. The tree was drawn using the package ggtree (Yu et al., 2017) in the R software program.

Phenotypic data of hybrids
Grain yield data of 121 grain sorghum hybrids from 32 maintainer lines and 21 restorer lines genotyped in this study were selected from eight trials evaluated in Sete Lagoas, MG, Brazil, over three years (2015, 2016 and 2017). The trials were delineated in a randomized complete block design with two replications (five trials) or three replications (three trials). Each plot consisted of two 5.0 m rows, with 0.50 m between rows for all eight trials. Grain yield was determined by weighing all grains in each plot, adjusted to 13 % of grain moisture and converted into tons per hectare (t ha -1 ).
First, phenotypic analyses were performed for grain yield data of the hybrids which were fitted using the following mixed model: where Y is the phenotypic value of the i-th genotype (i = 1, ..., I) from the block (j = 1, ... J), in replicate k (k = 1, ..., k); in trial l(l = 1, ..., L) and m the general mean; g i the fixed effect of the i th genotype; b j the random effect of Next, the estimate of the means of the B and R lines used the following mixed model: where Y is the phenotypic value of the i th genotype (i = 1, …, I) from the j block (j = 1,…,J), in trial (l = 1,…, L); in restorer lines (m = 1,…, M), in maintainer lines (n = 1,…, N), m the general mean; g i the fixed effect of the i th genotype; r j the random effect of j th replicate within trial l, where r N j r  ( , ) 0 2 σ , with the variance of replicates within trials σ r 2 ; t l the random effect of l th trials, where t N l t  ( , ) 0 2 σ , with the variance of trials σ t 2 ; LR m the fixed effect of the m th restorer lines; LB n the fixed effect of the n th maintainer lines; g il the random effect of i th genotype within trials (l), where g N il g  ( , ) 0 2 σ , with the variance of replicates within trials σ g 2 ; LR ml the random effect of m th restorer lines within trials (l) where LR N ml LR  ( , ) 0 2 σ , with the variance of replicates within trials σ LR 2 ; LB nl the random effect of n th maintainers lines within trials (l), where LB N nl LB  ( , ) 0 2 σ , with the variance of replicates within trials σ LB 2 , and ∈ a residual effect, with  N( , ) 0 2 V H , in which V H 2 is the residual variance. For both models, the adjusted means of each hybrid and line were obtained via Best Linear Unbiased Estimator (BLUE) using the ASReml-R statistical package v.3 (Butler et al., 2009) in the R software program. (R Core Team, version 3.2.5).

Molecular markers
The filtering process of 86,342 single nucleotide polymorphisms (SNP) markers dispersed along the ten sorghum chromosomes generated by genotyping-bysequencing (GBS) considering an MAF of 5 % and a maximum of 20 % missing data per locus resulted in 29,649 polymorphic SNPs.

Genetic diversity
PIC values of individual SNPs ranged from 0.07 to 0.38 with an average of 0.24. The PIC for B-lines was 0.20 and for R-lines was 0.25. Population structure analysis using the criteria proposed by Evanno et al. (2005) indicated an optimum value of k = 2 ( Figure 1A). This clustering was consistent with the classification of lines into restorer (R) or maintainer (B) ( Figure 1B).
The population structure revealed by the principal component analysis (PCA) based on the SNP marker was also consistent with the pedigree data of the sorghum lines. The first (PC1) and second (PC2) principal components explained 19 % and 7 %, respectively, of the genetic variability observed in the sorghum lines ( Figure 2). The G1 group included R-lines derived from BRP5BR, SC748 and SC326-6. The G2 group was mainly formed by R-lines derived from BRP5BR, BRP3BR, SC326, TX2536 and SC170. G3 included mainly B-and R-lines derived from SC748, SC326, SC170 and ATF54. Finally, the G4 group clustered B-lines derived from ATF54, ARG1 and TX623 (Figure 2). The G1 and G4 groups were more defined than G2 and G3.
The genetic dissimilarity mean between pairs of lines was 0.33, ranging from 0.012 to 0.46. The dendrogram using Neighbor-Joining revealed a more detailed relationship between the lines based on the pedigree data (Figure 3). Similarly, the heatmap showed that this further supported the four clusters, plus a set of mixed lines. The degree of relatedness between lines can be viewed through the kinship heatmap which supported the four clusters, plus a set of mixed lines (Figure 4). However, a few exceptions were revealed between PCA and the Neighbor-Joining dendrogram. The clusters revealed pedigree-consistent groups which were the result of crosses with inbred lines generated in the hybrid breeding program.

Hybrid performance and magnitude of heterosis for different groups of hybrids
There was significant correlation between the genetic distance of the inbred lines and derived hybrids' grain yield performance (r = 0.32), heterosis (r = 0.20) and heterobeltiosis (r = 0.21). The correlations of grain yield with heterosis (r = 0.82) and heterobeltiosis (r = 0.96) were significant and high ( Figure 5).
The average grain yield for the 121 hybrids was 4.28 t ha −1 , whereas the predicted means for the female and male parents were 4.12 and 4.07 t ha −1 , respectively. Approximately 60 % of all hybrids evaluated presented positive heterosis and 47 % presented positive BPH (Heterobeltiosis) for grain yield. However, variation was observed in the magnitude of H (Heterosis) (%) and BPH (%) for minimum values -43 % and -45 %, respectively. The maximum values were 73 % and 63 %, respectively for both indices.
The hybrids were grouped according to the allocation of their parental lines in one of the four groups that were defined based on PCA ( Figure 6). The lines belonging to Group 3 generated hybrids within the same group, which consisted of B-and R-lines. However, the grain yield of these hybrids was similar to the hybrids between lines from different groups, such as G1 × G2.
The hybrids between lines from groups G1 × G2, G1 × G3 and G2 × G4 presented higher yield, compared   Genetic distances between sorghum lines were calculated using the identity-by-state (IBS) coefficient. The colors inside the branches followed the same as those obtained by using the two first principal components. Genetic diversity in grain sorghum Sci. Agric. v.78, n.6, e20200039, 2021 to the other groups. The largest number of hybrids were G1 × G3, which included 31 hybrids with grain yield ranging from 2.4 to 6.1 t ha -1 , with an average of 4.37 t ha −1 . The heterosis in G1 × G3 ranged from -17 % to 35 %, with an average of 4 %. Heterobeltiosis of these hybrids ranged from -28 % to 29 %, with a mean of 0.5 %. The hybrid with the highest grain yield (6.1 t ha -1 ) presented heterosis of 24 % and the hybrid with the lowest yield (2.4 t ha -1 ) presented heterosis of -21 %.

Discussion
The genetic characterization of elite germplasm with SNP markers provides important information to a definition of breeding strategies and the identification of superior complementary lines. In this context, we applied SNP markers to study sorghum elite lines using principal component, population structure, and clustering analyses.
The polymorphism information content (PIC) provides an estimate of the discriminatory power of markers by taking into account the number of alleles at a locus and the relative frequencies of those alleles in the population. The SNP markers used in our study for the elite grain sorghum lines are informative. PIC is dependent on the kind of marker and the population. The mean PIC value of the SNPs among the 160 elite sorghum lines was 0.24, very similar to the average PIC value of 0.20 obtained from 1,841 SNPs in 208 diverse sorghum accessions by Bekele et al. (2013). Our study used elite lines submitted to certain selection cycles, which narrowed their genetic variability and changed the allelic frequencies (Takano-Kai et al., 2009). B-lines presented lower PIC than R-lines, due to a smaller number of B-lines used in the study. The sorghum breeding program normally has a limited number of B-lines which are used in a sorghum breeding program to maintain the male sterility of A-lines. The development of new B lines is a lengthy process, since prior to the crossings there must be male sterility (Rooney, 2007;Jordan et al., 2010).
The population structure analysis divided the lines into two subgroups ( Figure 1A), which was consistent with the classification of the lines into restorer (B) and maintainer (R) ( Figure 1B). A similar result was reported by Mindaye et al. (2015) working with Ethiopian sorghum. The PCA and NJ analysis are broadly used in genetic diversity studies and their results reflect more specifically the family relationship (Price et al., 2010). The PCA divided the sorghum lines into four groups, and NJ analyses were highly consistent with their pedigree information. The inbred lines clustered in the same group shared similar pedigree. Only a few inbred lines shared pedigree from one group with another. This clustering corroborated with the population structure and is useful for dividing the lines into heterotic groups, directing the crosses to be performed, and allowing for exploitation of heterosis by crossing inbred lines belonging to different genetic groups.
The G1 group consisted mainly of R-lines (51) but also included four B-lines with pedigree derived from BRP5BR, SC748 or SC326-6. Two of these B-lines were allocated together with other B-lines from the G3 and G4 groups in the dendrogram. Line 101B was phylogenetically included in G4 in agreement with its B genome and pedigree origin from ATF54 and Tx623, indicating a more adequate allocation based on NJ analysis.
BRP5BR is a population created from a random mating of eleven lines selected for aluminum tolerance. The other two lines were used to introduce resistance to anthracnose in the population. Most of these lines are Caudatum and Guinea races originating from Brazil and Africa.
The G2 group was presented with 37 R-lines and 2 B-lines, most of them derived from BRP5BR, BRP3BR, SC326, Tx2536 or SC170, comprising Caudatum, Guinea, Durra and Bicolor races, originating from Africa. Line 130B (ATF54), grouped in G2 by PCA, was allocated close to other B-lines in the dendrogram, maintaining a consistency of differentiating B-and R-lines.
The G3 group was composed of 19 R and 16 B lines, sharing a more diverse pedigree and being more disperse in the dendrogram. Some of the lines in this group were Caudatum race, derived from SC748, SC326, and SC170, introduced from the USA or Africa. As a result of this wide diversity, the hybrids generated by crosses of these lines presented good yield performance. The G4 group comprised 29 B lines and 2 R, mostly derived from ATF54 and ARG1. The clustering of the G1 and G4 groups were consistent with the restauration of fertility, the former with R-lines only and the latter with B-lines only.  Although genetic diversity based on molecular data has been proposed as having positive correlation with heterosis of F1 hybrids, a strong association has rarely been observed between hybrid yield and genetic distance between their parents (Jordan et al., 2003;Amelework et al., 2017). In our study, the hybrid performance was correlated with genetic distance (r = 0.32), which was comparable to the correlation observed by Jordan et. al. (2003) (r = 0.42), using 162 F1 sorghum hybrids derived from 70 lines. These results demonstrated positive correlation between genetic distance of the parental lines and heterosis with grain yield. However, these correlations are dependent on the genetic background of the parental lines. The hybrids derived from the crosses of lines belonging to the G1 and G3 clusters, based on PCA, presented higher grain yield, heterosis and heterobeltiosis ( Figure  5). However, this group had only a few representatives. The second group of hybrids with high yield hybrids was between lines of the G1 × G2 cluster and the G1 × G3 group. The latter had 31 hybrids and the former, four hybrids only. Therefore, the cluster G1 × G3 was more representative and could be used to develop new hybrids. Genetic diversity in grain sorghum Sci. Agric. v.78, n.6, e20200039, 2021 The genetic variability of breeding lines within pre-defined heterotic groups can be better explored by high throughput genotyping, which can also be applied to the classifying of new lines into heterotic groups, in order to effectively contribute to the development of high-yielding sorghum hybrids. A better understanding of genetic diversity in sorghum will enhance the use of lines, guide ongoing efforts in sorghum and accelerate breeding.

Conclusions
The molecular marker data reveal the existence of genetic divergence between the groups of maintainer (B) and restorer (R) lines.
The R-lines showed greater genetic diversity than B-lines, explained by the fewer number of maintainers in the program.
The PCA analyses and Neighbor-Joining methods showed high concordant classification of the breeding lines, and can be used to determine heterotic groups in sorghum.
The genetic distances are correlated to heterosis, supporting the selection of more contrasting lines when developing sorghum hybrids.