Genome-wide association for mapping QTLs linked to protein and oil contents in soybean

The objective of this work was to identify single-nucleotide polymorphism (SNP) markers linked with quantitative trait loci (QTLs) associated with increased contents of protein and oil in soybean. A total of 169 Brazilian soybean varieties, genotyped with 6,000 SNP markers, were evaluated. Protein and oil contents were obtained with the near-infrared reflectance method. Correlation and multiple linear regression analyses were used to identify linkage disequilibrium between SNP markers and the QTLs associated with the two characteristics. Seven QTLs were found to be associated with protein content, on six chromosomes (2, 6, 11, 12, 13, and 16), explaining 60.9% of the variation in this trait. For oil content, eight QTLs were identified on six chromosomes (1, 4, 5, 6, 17, and 19), explaining 78.3% of the variation in the trait. The correlation between the number of loci containing favorable alleles and the evaluated characteristics was 0.49 for protein content and 0.60 for oil content. The molecular markers identified are mapped in genomic regions containing QTLs previously mapped for both characteristics, which reinforces the association between these regions and the genetic control of oil and protein contents in soybean.


Introduction
Soybean (Glycine max L.) is the world's main source of plant protein and oil, providing approximately 20 to 24% of the oil and fat consumed worldwide (Cavalcante et al., 2009).This legume also has the greatest concentration of protein of all food crops.The soybean production technology used in Brazil is among the best in the world, as are Brazilian soybean yields (Masuda & Goldsmith, 2009).Commercial soybean varieties in the country contain about 40% protein and 20% oil, which vary according to genetic and environmental factors (Soares et al., 2004;Moraes et al., 2006).
Quantitative trait locus (QTL) mapping strategies that use populations derived from two-parent crossing reveal only a small fraction of all possible alleles in a target species.Therefore, molecular markers associated to QTLs can generally be used only in populations for which the markers were specifically developed (Cahill & Shmidt, 2004;Holland, 2004;Schuster, 2011).QTL mapping using linkage disequilibrium infers the association between genotypes (or haplotypes) and phenotypes by evaluating the genetic polymorphism generated in different genetic backgrounds, throughout many recombination generations (Dekkers & Hospital, 2002;Nordborg & Tavaré, 2002).
Association mapping detects and locates QTLs based on the correlation intensity between molecular markers and phenotypic characteristics.Linkage disequilibrium between two loci is a function both of the time (number of generations) passed since the recombination generations began and of the recombination frequency between the loci.After many recombination generations in an unstructured population, only the correlations between QTLs and markers closely linked should remain, facilitating a more precise mapping (Mackay & Powell, 2007).For linkage disequilibrium mapping, however, there is no need to prepare a mapping population, and the entire genome is evaluated to identify regions associated with a particular phenotype.The greater the association between marker alleles and phenotype variants, the greater the probability that the phenotype is physically linked to the marker (Hwang et al., 2014).
The objective of this work was to identify SNP markers linked with QTLs associated with increased contents of protein and oil in soybean.

Materials and Methods
The experiment was carried out at the facilities of Cooperativa Central de Pesquisa Agrícola, located in the municipality of Cascavel, in the state of Paraná, Brazil.A total of 169 Brazilian soybean varieties were evaluated in the 2011/2012 crop year, and the field trial was performed in a 13x13 lattice design.The plots contained four 5.0-m lines, where the agronomic characteristics of the varieties were assessed.At harvest, a sample of grains was taken from each variety for the evaluation of protein and oil contents.
Approximately 20 g of grains from each sample were ground in a cyclone-type grinder to obtain uniform, thin powder.Protein, oil, and moisture contents were determined using the Instalab 600 near-infrared reflectance (NIR) device (Dickey-John, Auburn, IL, USA), which was previously calibrated.The data were converted to express the contents of oil and protein on a dry matter basis.
A sample of 100 seeds from each variety was ground, and 50 mg of homogenized powder were used for DNA extraction, as described by Schuster et al. (2004).
The genotyping with SNP markers was carried out at Deoxi Biotecnologia Ltda, located in the municipality of Araçatuba, in the state of São Paulo, Brazil, using the iScan platform and the 6k Infinium iSelect HD Custom Genotyping BeadChip panel (Illumina, Inc., San Diego, CA, USA), customized for soybean.The process followed the instructions of the manufacturer (Illumina, 2014).Markers with more than 10% lost data or minor allele frequency (MAF) less than 5% were removed from the analysis.
The association between markers and phenotypes was assessed by correlation and multiple regression analyses.The correlation analysis was carried out using the following expression: where x i is the score of the genotype with marker x for individual i; y i is the phenotype of characteristic y in individual i; and n is the number of samples.The markers received score 0, for one homozygote genotype; and 2, for the alternate homozygote genotype.The few heterozygote genotypes were considered lost data.The square of the correlation value (R 2 ), weighted by n -1, has a chi-squared distribution with one degree of freedom: χ 2 = (n -1)r 2 ; where n is the number of individuals and r is the correlation between the marker and the phenotype.
Correlation significance was determined using a chi-squared distribution, expressed as -log10(p), where p is the probability value.The significance levels were corrected using the false discovery rate (FDR) method (Benjamini & Hochberg, 1995).The correlation Pesq.agropec.bras., Brasília, v.52, n.10, p.896-904, out.2017 DOI: 10.1590/S0100-204X2017001000009 analyses and the FDR correction were carried out in an Excel sheet.Markers with p-values lower than 0.0001 after the correction with the FDR method were considered associated with the studied phenotype.
The JMP software (SAS Institute Inc., Cary, NC, USA) was used for the multiple regression analysis.Input and output probability was 5%.The Stepwise procedure for variable selection was adopted.For the multiple regression analysis, only markers that were significant in the correlation analysis (p˂0.0001) were used.
In addition, a correlation analysis was performed between the number of loci containing favorable alleles with the selected marker and the protein and oil contents of the soybean variety.

Results and Discussion
The contents of oil and protein of the 169 varieties varied from 37.2 to 48.3% for protein and from 18.2 to 27.5% for oil, based on the dry matter of the grains (Figure 1).However, protein contents between 42 and 43%, and oil contents between 23 and 24% were more frequent.
After filtering the markers for MAF higher than 5% and lost data lower than 10%, 4,962 SNPs were used to analyze genome association.In the correlation analysis, a significant association was observed for markers in all of the 20 chromosomes, for protein content, and in all chromosomes, except 7 and 16, for oil content (Figure 2).In the soybean consensus map (USDA, 2016), 125 QTLs have been mapped for protein and 148 QTLs for oil, and all of the 20 chromosomes contain QTLs for oil and protein contents.
In the multiple regression analysis, seven significant markers were associated with protein contents, in the six following chromosomes: 2, 6, 11, 12, 13, and 16 (Table 1).These markers explained 60.93% of the variation in the protein content in the evaluated set of varieties.Redundant markers, associated with the same QTL, were eliminated from the model, leaving just one marker associated with each QTL.Markers linked to QTLs with minor effects were also eliminated.Therefore, although 95 markers were significant in the correlation analysis, on all 20 chromosomes, only 7 markers, on 6 chromosomes, explained more than 60% of the variation in protein content.The regression coefficient values of these seven markers varied from 0.41 to 1.45, meaning that the substitution of one allele in the SNP markers associated with protein resulted in an increase in protein content between 0.41 and 1.45%.
On chromosome 6, there are two significant SNPs according to the multiple regression model.These SNPs are separated by more than 7.8 Mb on the soybean genome and are in linkage equilibrium.This means that each marker was associated with a different QTL.Jun et al. (2008) identified 11 QTLs for protein content in soybean using the association analysis.Csanádi et al. (2001) reported QTLs for protein content on chromosomes 1, 6, 7, and 9.In Brazil, Soares et al. (2008) detected QTLs on chromosomes 3, 6, 15, 18, and 19, explaining from 38.84 to 55.53% of the variation in protein content in soybean, depending on the cultivation site.QTLs for protein content in soybean have also been found on chromosome 20 (Chung et al., 2003;Nichols et al., 2006), chromosome 18 (Panthee   Leamy et al., 2017), and chromosome 14 (Zhang et al., 2004;Leamy et al., 2017).
In the region of chromosome 13 where marker Gm13_33637077_T_C was mapped, three QTLs were associated with protein in the soybean consensus map: Seed Protein 6-1, Seed Protein 24-2, and Seed Protein 26-11.The Seed Protein 5-2 QTL is mapped on the same region of chromosome 12 where  marker Gm12_35525603_A_G is mapped.Marker Gm11_5536036_C_A is mapped on chromosome 11, in the same region as the Seed Protein 3-2 and Seed Protein 34-7 QTLs.On chromosome 6, marker Gm06_12914255_A_G is mapped in the same region as the Seed Protein 34-2 QTL, while marker Gm06_49106015_C_T is located on the edge of chromosome 6.In this region, there are no protein QTLs mapped (USDA, 2016).None of the assessed varieties had alleles associated with high or low protein contents on all of the seven markers.The ten varieties with the highest protein content had 4-6 loci with alleles associated with high protein content on the seven markers considered.Of these, the four varieties with the highest protein content had at least five markers with alleles associated with high protein content (Table 2).Among the ten varieties with the lowest protein content, the number of loci associated with high protein content varied from one to four.
Among the set of 20 varieties containing the ten higher and ten lower protein contents (Table 2), the correlation between protein content and the number of loci containing alleles associated with the trait was 0.84.Out of all 169 varieties, the correlation was 0.49, significant at 1% probability.Although there is a close association between protein content and the number of loci containing favorable alleles, this association is even more important in identifying extreme-high and extreme-low protein contents in soybean.
Of the seven markers, the best at discriminating genotypes into a group of high or low protein content were Gm11_5536036_C_A (chromosome 11) and Gm13_33637077_T_C (chromosome 13).In the group of varieties with the highest protein contents containing data for the marker, seven had alleles associated with high protein content on the SNP Gm11_5536036_C_A.In the group of nine varieties with the lowest protein content containing data for the marker, none had any allele associated with high protein content on this locus.
On the locus SNP Gm13_33637077_T_C, all ten of the varieties with the highest protein content had an allele associated with high protein content, whereas, in the group of eight varieties with the lowest protein content, only two had this allele.
Oil content had eight significant SNP markers in the stepwise multiple regression analysis, on the following six chromosomes: 1, 4, 5, 6, 17, and 19 (Table 1).78.30% (1) Angular coefficient of the regression equation. (2)The allele associated with increased protein content was codified as A, and the allele associated with reduced protein content was codified as B. R 2 , correlation value; SNP, single-nucleotide polymorphism; and SD, standard deviation.
These markers explained 78.30% of the variation in oil content, in the set of varieties studied.Two SNPs were identified on chromosome 5 and two on chromosome 17.Because the markers were in linkage equilibrium on the same chromosome, they were probably linked to different QTLs.The regression coefficients of the eight markers varied from 0.31 to 0.69.Csanádi et al. (2001) identified QTLs for oil content on chromosomes 14, 9, and 20.Panthee et al. (2005) reported QTLs on chromosomes 2, 10, and 18.The QTL of chromosome 2 was also detected by Zhang et al. (2004).Leamy et al. (2017), in turn, found QTLs on chromosome 3 and 20 in wild soybean.Although none of these QTLs were observed in the present study, the eight QTLs identified were located in regions that have QTLs mapped for oil and fatty acid contents in the consensus map (USDA, 2016).
The Seed Oil 23-3 QTL is mapped in the region of chromosome 17, near the marker Gm17_8270421_A_G.The Seed Oil 6-1, Seed Protein 3-3, cqSeed Oil-001, and Seed oleic 6-5 QTLs are mapped near marker Gm04_40811025_C_A on chromosome 4. Marker Gm05_32361439_C_A is located in a region that contains both the Seed Oil 4-1 and Seed oil to protein ratio 1-1 QTLs, on chromosome 5. Marker Gm05_8656389_T_C is also on chromosome 5, near the Seed palmitic 2-1, Seed linolenic  QTLs.Marker Gm19_1420943_T_C is located on chromosome 19 near the Seed Oil 27-5, Seed linolenic 7-1, Seed linolenic 8-3, and Seed oil 37-6 QTLs.Marker Gm01_51330200_C_A is on chromosome 1, near the Seed Oil 24-21 QTL.Marker Gm17_18586619_C_T is located on chromosome 17 in the region that contains the Seed Oil [5][6] Table 2. Single-nucleotide polymorphism (SNP) genotypes in the groups of ten soybean (Glycine max) varieties with the highest (allele A associated with increased protein contents) and the lowest protein contents (allele B associated with decreased protein contents).(1) Number of loci containing alleles associated with high protein content, in the group of seven SNP markers being considered.(2) Data lost in genotyping.
In the group of ten varieties with the highest oil contents, the number of loci with alleles associated with the trait varied from 5 to 8, considering the eight significant SNP markers; however, in the group of ten varieties with the lowest oil content, the number varied from 2 to 4 (Table 3).In the group with these 20 varieties, the correlation between oil content and the number of loci was 0.88, and, when all 169 varieties were considered, the correlation was 0.60, significant at 1% probability.
Markers Gm01_51330200_C_A (chromosome 1) and Gm04_40811025_C_A (chromosome 4) were the most common in the group with high and low oil contents, respectively.The SNP Gm01_51330200_C_A had alleles associated with high oil content in nine of the ten varieties with the highest oil contents, but only in one of the ten varieties with the lowest (Table 3).The SNP Gm01_51330200_C_A had alleles associated with high oil content in eight of the ten varieties with the highest oil contents, but in none of the ten varieties of the group with the lowest.
The SNPs Gm05_32361439_C_A and Gm05_8656389_T_C (chromosome 5) and Gm17_8270421_A_G (chromosome 17) were also well defined, with alleles associated with high oil content in nine of the ten varieties with the highest contents and in five of the nine with the lowest (Table 3).
In the studied population, the correlation between oil and protein contents was -0.68.Note that the significant markers in the multiple regression models for both characteristics were found on different chromosomes, except for chromosome 6, which had QTLs for protein and oil.Chromosomes 2,11,12,13,and 16 have QTLs for protein content, while chromosomes 1, 4, 5, 17, and 19 have QTLs for oil content.The correlation between the two characteristics might occur due to population structure, genetic linkage, or pleiotropy.Because the QTLs for protein and oil contents were found on different chromosomes, there were QTLs that were not linked.While there are genetic and physiological limits to the simultaneous increase in both oil and protein contents in the grains, the presence of independent QTLs for these two characteristics indicates that it should be possible to simultaneously select for QTLs associated with oil and protein contents.
All SNP markers significant in the multiple regression analysis were located in regions with QTLs for oil and protein contents previously identified, except Gm06_49106015_C_T.Since these traits are complex, the identification of the same QTLs in independent studies increases the consistency of their mapping.
Mapping QTLs associated with protein and oil contents in soybean has generally been carried out on structured populations, which is not the case of the present study.However, the genomic regions associated with QTLs for protein and oil contents found here are consistent with those of a structured population.This is not surprising because the genetic bases of cultivated soybean are relatively narrow.
In the marker-assisted selection for quantitative traits, plants with the greatest number of favorable alleles in QTL loci should be selected, since this greatly increases the chance of adding desirable characteristics to the plants.The high correlation between the number of loci containing favorable alleles associated with the studied phenotypes, observed in the present study, can certainly help Brazilian soybean genetic breeding programs to generate superior genotypes as to the contents of protein and oil in grains.This is the first known study that identifies QTLs associated with protein and oil contents using Brazilian germplasm subjected to genome-wide association analysis, and it should be a starting point for understanding the population structure of this germplasm as to protein and oil contents.

Conclusions
1.The single-nucleotide polymorphism markers associated with oil and protein contents in soybean (Glycine max), reported here, can enhance the mapping consistency of the quantitative trait loci (QTLs) identified in other studies for these characteristics in the same genomic regions.
2. Selecting plants or lines containing the greatest number of favorable QTLs on the identified markers can increase their oil or protein content.
3. Most of the QTLs for protein content are found on different chromosomes than those for oil content, which allows marker-assisted selection to provide gains for both characteristics simultaneously.

Figure 1 .
Figure 1.Distribution frequency of protein (A) and oil (B) contents in the 169 Brazilian soybean (Glycine max) varieties evaluated.DM, dry matter.

Figure 2 .
Figure 2. Manhattan plot of the association probability between single-nucleotide polymorphism markers and the protein (A) and oil (B) contents of the 169 Brazilian soybean (Glycine max) varieties evaluated.The horizontal line indicates the cut-off point for significance, at 0.01% probability.

Table 1 .
Stepwise multiple regression analysis of the protein and oil contents of the 169 Brazilian soybean (Glycine max) varieties evaluated, averages of the genotypes containing alleles associated with high and low protein contents, and number of individuals found for each genotype.

Table 3 .
Single-nucleotide polymorphism (SNP) genotypes in the groups of ten soybean (Glycine max) varieties with the highest (allele A associated with increased oil contents) and the lowest oil contents (allele B associated with decreased oil contents).
(1) Number of loci containing alleles associated with high oil content, in the group of eight SNP markers being considered.(2)Datalost in genotyping.