Development of SNP markers for grain yield screening of Brazilian rice cultivars

The objective of this work was to identify and validate singlenucleotide polymorphism (SNP) markers related to grain yield in rice (Oryza sativa) core collection. The genome-wide association studies (GWAS) methodology was applied for genotyping of 541 rice accessions by 167,470 SNPs. The grain yield of these accessions was estimated through the joint analysis of nine field experiments carried out in six Brazilian states. Fifteen SNPs were significantly associated with grain yield, and out of the ten SNPs converted to TaqMan assays, four discriminated the most productive accessions. These markers were used for the screening of rice accessions with favorable alleles. The selected accessions were, then, evaluated in field experiments in target environments, in order to select the most productive ones. This screening reduces the number of accessions evaluated experimentally, making it possible to prioritize those with higher productive potential, which allows of the increase of the number of replicates and, consequently, of the experimental accuracy.


Introduction
There is a consensus on the need to increase food production, to meet the demand of the growing world population. A practical way to achieve this goal is to use molecular markers to increase the efficiency in the development of lines with a higher grain yield (Gupta et al., 2019). Single-nucleotide polymorphism (SNP) markers have been used in plant genomics due to their wide genome distribution and low cost for scoring (Voss-Fels et al., 2019). SNP markers can be used to genotype rice by medium (6K) or high-density chips (60K) (Tao et al., 2019), sequencing (Elshire et al., 2011), and resequencing (Barabaschi et al., 2016). The combination of high-throughput DNA sequencing and phenotyping of rice (Oryza sativa L.) cultivars can identify genes related to quantitative traits, through quantitative trait loci (QTL) analysis and genome-wide association studies (GWAS) (Mochida et al., 2018). Huang et al. (2010) and Lu et al. (2015) used the GWAS analysis to identify hundreds of SNP markers related to important agronomic traits in rice. Currently, the difficulty is not only to identify potentially useful molecular markers, but also to convert them into a useful and reproducible tool for breeding programs .
The greatest impact of the utility of marker-assisted selection occurred on the incorporation of resistance to rice blast disease, for which out of the more than 100 resistance genes identified, 28 have been cloned and validated. The pyramiding of these genes has led to the development of cultivars of longer resistance to rice blast (Wu et al., 2019). However, for traits with more complex inheritance, such as grain yield, the correlation between a particular marker and QTL is weak, hindering the improvement of the genotype of interest . Even though it was identified and validated, this marker-QTL correlation has been ineffective beyond the original work population in different environments and years of experimentation. Begum et al. (2015), through the GWAS analysis of a rice breeding population, found three SNP markers related to a high-yield haplotype that explained approximately 9% of the phenotypic variation, while Zhang et al. (2017) described a SNP that showed high heritability for productivity. According to Cobb et al. (2019), a QTL may exhibit a high percentage of the phenotypic variation explained, but may have little biological significance. Moreover, the values of the markers associated with productivity are individually small and, as a whole, they are not responsible for most of the phenotypic variation. This makes markerassisted selection difficult, as genotyping will not be a good predictor for grain yield. Platten et al. (2019) observed that markers associated with characters of interest are rarely validated, that is, the results generally does not meet the marker-assisted breeding routine.
Therefore, it would be valuable to have an alternative strategy that could be effectively incorporated into the routine of assisted selection, in order to increase the chance of success of this process, beyond the identification of poorly reproducible associations between favorable markers and phenotypes in a given study population.
The objective of this work was to identify and validate SNP markers related to grain yield in rice core collection.

Materials and Methods
The GWAS panel consisted of the 550 accessions of Embrapa's rice core collection (Abadie et al., 2005), and included 94 Brazilian lines and cultivars (57 upland and 37 lowland accessions), 148 international lines and cultivars (76 upland and 72 lowland accessions), and 308 Brazilian landraces (148 upland and 77 lowland accessions, and 83 accessions for both cropping systems). The 550 rice accessions and four checks ('BR/IRGA 409', 'Metica 1', 'BRS Caiapó', and 'BRS Colosso') were evaluated in nine experiments, in six Brazilian states (Table 1). Federer's augmented-block experimental design was carried out with 23 blocks. The plot size consisted of three rows of 4 m, with a density of 100 seeds per meter. The statistical analysis of grain yield data (kg ha -1 ) was performed using the lme4package of the R platform (R Core Team, 2018). In the joint analysis procedure, the random effects and the experimental error were considered for blocks and genotypes (except for checks). The estimates of variance components were obtained by the residual maximum likelihood (REML). The estimates for the genetic values of grain yield of each accession were performed by BLUPS (best linear unbiased prediction).
The genomic DNA of the 550 accessions was obtained from young leaves using the DNeasy 96 Plant Kit (Qiagen, Germantown, MD, USA). The Pesq. agropec. bras., Brasília, v.55, e01643, 2020 DOI: 10.1590/S1678-3921.pab2020.v55.01643 SNP markers were obtained by GBS (genotyping by sequencing), a method proposed by Elshire et al. (2011). To determine the marker quality, the parameters "reproducibility" (percentage of technical replicate pairs scoring identically for given marker) and "callrate" (percentage of samples for which a marker was scored) were used. For the genetic analysis only the SNPs showing 0.01 as minor allele frequency (MAF) set, 0.9 inbreeding coefficient, and 0.1 minimum locus coverage were considered. Data input was performed by FastPHASE 1.3 software (Scheet & Stephens, 2006). The input accuracy was estimated by the concordance rate (proportion of correctly input genotypes), in which 10% of the genotypes were randomly masked, followed by input and comparison with the true results. Population structure was estimated using the Bayesian model of the Markov chain Monte Carlo (MCMC) implemented in the Structure program (Pritchard et al., 2000). Five iterations were performed for each number of populations (k) tested from 1 to 10. Burnin value and number of replicates of MCMC were set at 50,000 and 100,000, respectively. The K value was determined by the data log likelihood [LNP (D)] and delta K, based on the change rate of [LNP (D)] between successive values of k. These analyses were performed using the Structure Harvester program (Earl & VonHoldt, 2011). The structuring data and the relationship matrix (K matrix or kinship) were obtained by the Tassel 4.0 software. From the data on population structure and kinship matrix, GWAS analysis was performed based on the mixed model method, correcting spurious associations that could occur due to the genetic similarity between accessions. The SNP markers identified as significantly associated with rice yield, and the structuring data were considered as factors of fixed effect, while the kinship matrix was considered as a factor of random effect. For better analysis reliability, rare alleles were removed, by filtering the SNP data input with 0.05 minimum allele frequency (MAF). The GWAS analysis used the GAPIT package (Lipka et al., 2012); the stepwise regression analysis was performed in the GCTA software (Yang et al., 2011), and the removal of the SNPs with overlapping effect was obtained by the R software, according to the methodology described in Pantalião et al. (2016). The selected SNPs were positioned in haplotypic blocks using the software Haploview (Barrett et al., 2005), which allowed of the identification of the candidate genes that cosegregated with the SNPs identified by GWAS as associated to the grain yield trait. Subsequently, the transcribed sequences of these genes were obtained to search for their putative functions in the Rice Genome Annotation Project (Kawahara et al., 2013).
A subset of grain-yield associated SNPs were selected for validation, using the TaqMan (Thermo Fischer Scientific, San Diego, CA, USA) probe-based chemistry designed for genotyping. The target SNPs were aligned with the reference genome (Os-Nipponbare-Reference-IRGSP-1.0-release 7), and a flanking region within 250 bp up-and downstream from the target SNPs was selected (Woodward, 2014). Before the design of allele-specific molecular markers, a DNA fragment of 501 bp length was evaluated for the presence of repetitive sequences, using the program RepeatMask (Smith et al., 2019), and for the presence of nontarget SNPs, using the SNPseek program (Mansueto et al., 2017) derived from the 3,000 Rice Genomes Project (Alexandrov et al., 2015). Those sequences without repetitive elements and containing only the target SNPs were used for primer design. Two sets of plant material were used for the SNP validation analysis, as follows: 27 inbred lines from Value of Cultivation and Use experiments (17 upland and 10 lowland rice inbred lines), from Embrapa's rice breeding program Table 2. Rice (Oryza sativa) inbred lines and checkers of Embrapa's value of cultivation and use (VCU) experiments genotyped by 10 TaqMan assays.
Genotypes for upland cropping system  (Table 2), with average grain-yield data from 15 field experiments; and 20 high-yielding and 20 low-yielding rice cultivars from the joint analysis involving the nine experiments of the rice core collection (Table 3). PCR reactions, in duplicate, were carried out in a 5 µL final volume of by the Custom TaqMan SNP Genotyping Assays 40X and TaqMan GTXpress master mix 2 X (Thermo Fisher Scientific, USA). The reactions were

Results and Discussion
There was a great variation for grain yield in all nine field experiments, as expected because of the high diversity of the set of accessions of the core collection, and by the environmental variation (Table 1). The average grain yield of the experiments varied from 1,108.4 kg ha -1 (Vilhena, RO) to 6,272.4 kg ha -1 (Pelotas, RS). The genotyping of the 550 accessions provided 526,220 SNPs, distributed on the 12 chromosomes. After the data input, accessions showing more than 20% of missing data were excluded, which resulted in 445,589 SNPs from 541 accessions. When 0.05 minimum allele frequency was applied, the final number was 167,470 SNPs (Table 4). The average distribution was approximately 449 SNPs/Mbp (one SNP every 2.23 kbp), ranging from 366 SNPs/Mbp on chromosome 5 to 507 SNPs/Mbp on chromosome 11. Chromosome 1 had the highest number of SNPs (21,662), while chromosome 9 had the lowest number of SNPs (9,788), with average 13,956 SNPs per chromosome. Considering the high-rice genome linkage disequilibrium, about 150 kbp, this average marker density is considered adequate to perform genome-wide association analyses (Rebolledo et al., 2015). The population structure analysis identified two groups of accessions (k= 2), corresponding to the Oryza sativa subspecies indica and japonica. The GWAS was performed with 167,470 SNPs, and identified 31 SNP markers significantly associated with the grain yield trait. After the stepwise regression analysis, 15 SNPs remained in the model, with R 2 value p<0.001 without significant difference (0.496 and 0.490, for 31 and 15 SNPs, respectively). Out of the 15 SNPs, 9 were located in genes, 3 were located in genes present in linkage blocks, and 3 were located in intergenic regions ( Table 5). The validation of the SNPs associated to characters identified by GWAS is an essential step to enable the effective use of these markers in breeding programs (Kikuchi et al., 2017). From the 15 SNPs maintained in the statistical model, 10 were used in the development of TaqMan probes. The remaining 5 SNPs showed a high percentage of repetitive DNA  (2) (1) Os-Nipponbare-Reference-IRGSP-1.0-release 7.
Pesq. agropec. bras., Brasília, v.55, e01643, 2020 DOI: 10.1590/S1678-3921.pab2020.v55.01643 in the adjacent sequences, thus it was not possible to design Taqman assays (S1_33418191, S9_1062037, S10_2231343, S12_3544726, and S12_17681142). Considering the 10 SNPs loci that derived the TaqMan assays, 28 genes were identified (25 genes located in linkage blocks), 20 of which were putatively annotated and related to metabolic processes, such as responses to biotic and abiotic stresses, responses to endogenous and exogenous stimuli, post-embryonic multicellular development, growth, and morphogenesis ( Table 6). The marker #7 is located in a gene (OsWAK, LOC_ Os10g01410), which was previously related to grain yield and panicle number traits (Zhang et al., 2017). The SNPs located in genes are candidates to be explored by genetic engineering (Chen et al., 2018). The rice inbred lines genotyped by TaqMan assays showed the following results: 7 markers were monomorphic; and 3 markers discriminated genotypes of lowland and upland cropping systems (S7_939762, #4; S10_251060, #7; S2_22142097, #15) (Tables 2 and 3). In Brazil, most of upland rice cultivars are of the japonica subspecies, and lowland cultivars are of the indica subspecies (Khush, 1997); therefore, these three markers can be helpful to identify the materials of unknown rice cropping systems. The exclusive alleles for indica or japonica, may have originated during the independent domestication of these two subspecies (Civan & Brown, 2018;Wang et al., 2018). The markers #6 (S9_20925193), #8 (S4_21506318), and #11 (S9_7799399) were unable to discriminate the most productive accessions. The accessions with average grain yield above 2,981 kg ha -1 were all discriminated by the markers #1 (S1_23079331), #2 (S2_26805540), #3 (S6_5353837), and #5 (S9_12051077) ( Table 3). The exception was for the landrace 'Santa Catarina', whose average productivity was 3,052 kg ha -1 , but showed markers #1, #2, and #5, which are the SNP pattern of plants of lower productivity. A possible explanation is that these four loci were selected independently, during the domestication and genetic breeding of cultivated rice, as observed by Xie et al. (2015). According to the selective sweep model, a series of variants in the genome that lead to adaptation is rapidly fixed in a population, and this creates a selection signature that consists in reducing genetic diversity, extending linkage disequilibrium in the genome region around the locus under selection (Gentzbittel et al., 2019). Due to the low heritability of the productivity trait, and the inconstancy of markers associated with characters of interest in different backgrounds , or when considering different environments , an alternative strategy needs to be implemented to make these markers useful for breeding programs.
Our suggestion is to use the four markers to select rice accessions from a gene bank, and then evaluate these accessions in field experiments to identify the most productive ones. This screening would reduce the number of accessions that would be evaluated in the field, making it possible to prioritize efforts in those with greater grain-yield potential. The genotypes with the greatest adaptability to a given location would be selected, and, then, they are used as genitors for the development of inbred lines targeted at specific locations. Conclusions 1. The four SNP markers associated with grain yield, identified in this work, can select Brazilian cultivars of rice with a greater productive potential.