Population structure correction for genomic selection through eigenvector covariates

We proposed a population structure correction for genome-wide selection based on covariance analysis via eigenvector (EVG) decomposition. The agreement between the predicted and true breeding values was evaluated by independent cross-validation data sets. Other correction methods such as correction via principal components, best linear unbiased prediction, and deregressed phenotype were also evaluated. Based on different simulation scenarios, the proposed EVG out performed the other methods in the prediction of accuracy.


INTRODUCTION
In genome-wide selection (GWS), the correction for population structure allows eliminating the effects of genitors or families from the total estimated breeding values in order to obtain the effects of pure Mendelian segregation.Thus, association analyses between allele markers and quantitative trait loci (QTLs) capture the genetic effects consequences of linkage disequilibrium (LD) by excluding relationship-based genealogy.
The performance of short-term genomic prediction (at present or subsequent generations) depends on genealogical kinship more than on LD (Resende et al. 2012).Therefore, predictions based on total genetic values, which consider both kinds of information, are recommended.On the other hand, long-term genomic prediction (prediction for future generations) only requires LD information between markers and QTLs, which is called Mendelian segregation effect.Thus, correction for population structure is required to obtain accurate predictions (Resende et al. 2012).Structure correction is also important for genome-wide association studies (GWAS) because they aim to detect causative variants and allow for association analysis between alleles and QTLs by considering only LD (free genealogy) and avoiding the occurrence of false positives (Daetwyler et al. 2012).
Therefore, statistical methods for population structure correction are required for efficient long-term GWS.Among them, correction via best linear unbiased prediction (BLUP) (Resende et al. 2012) and procedures based on deregressed phenotype (DP) (Garrick et al. 2009) are highlighted.Covariance analysis via eigenvectors (EVG), which is based on the genomic relationship matrix and was presented by Patterson et al. (2006) andYang et al. (2011) to correct for population structure in GWAS, has never been tested for genomic prediction.The inclusion of principal components (PC) (Daetwyler et al. 2012, Riedelsheimer et al. 2012, Price et al. 2006) obtained from the markers used in the model has also been proposed for population structure correction in both GWAS and GWS.The abovementioned correction methods consist of including eigenvectors from the genomic relationship matrix G in the model as fixed covariates.The eigenvectors are associated with the highest eigenvalues and first PCs of G. Thus, since G is linked to individuals and not to markers, the objective is to capture genetic variations depending on population structure.However, from the viewpoint of genomic selection, no reports on the comparison among these four population structure correction methods have been published yet.
Based on previous evaluations of methodologies for GWS, GWAS, and QTL mapping conducted with simulation studies (Azevedo et al. 2015, Piccoli et al. 2014, Ribeiro et al. 2005), we assessed the performance of the EVG method or population structure correction in terms of efficiency in estimating the free genomic values of relationships using simulated data under different scenarios (i.e., 2 heritability levels × 2 genetic architectures).We compared the proposed EVG method with other methods (i.e., BLUP, DP, and PC) in terms of prediction accuracy obtained from each scenario.

Simulated datasets
The datasets generated according to Azevedo et al. (2015) were simulated using the Real Breeding software (Viana 2013).Two random mating populations in linkage equilibrium were crossed to generate a population of 5,000 elements from 100 families using LD, which was subjected to five generations of random mating without mutation, selection, or migration.The resultant population is an advanced generation composite, which presents Hardy-Weinberg equilibrium and LD.According to Viana (2004), the LD value (�) in a composite population is , where a and b are two single nucleotide polymorphisms (SNPs), two QTLs, or one SNP and one QTL, θ is the frequency of recombinant gametes, and p 1 and p 2 are the allele frequencies in the parental populations 1 and 2, respectively.Moreover, the LD value depends on the allele frequencies in the parental populations.Thus, regardless of the distance between the SNPs and/or QTLs, if the allele frequencies are equal in the parental population, Δ = 0.The LD is maximized (|Δ| = 0.25) when θ = 0 and |p 1 − p 2 | = 1.In this case, the LD value is positive with coupling and negative with repulsion (Azevedo et al. 2015, Viana 2004).
From the advanced generation of the composite, 1,000 individuals with diploid genomes having a length of 200 centimorgans (cM) (L=2 Morgans) were generated assuming 10equally sized chromosomes, each one with two haplotypes.Azevedo et al. (2015) simulated a marker density by assigning 2,000 equidistant SNP markers that were separated by 0.1 cM across the 10chromosomes.Interestingly, 100 of the 2,000 markers were actually genes (QTL).The 1,000 individuals that came from the same generation and from 20 full-sib families (each one with 50 individuals) were genotyped and phenotyped.This simulation provided a typical, small effective population size (Ne= 39.22) and large LD in the breeding populations.Ne of approximately 40 and 50 individuals per family are typical values in elite breeding populations of allogamous plant species (Resende 2002).
The QTLs were distributed among the regions covered by the SNPs.For each trait, we informed the degree of dominance (d/a, where a and d are the genotypic values for one homozygote and heterozygote, respectively) and the direction of dominance (positive and/or negative).The obtained genotypic values for homozygotes were within the limits of Gmax=100(m + a) and Gmin=100(m -a) where m is the mean of genotypic values, which are the maximum and minimum values, respectively.Goddard et al. (2011) presented the realized proportion (r 2 mq ) of genetic variation explained by the markers as , where n QTL is the number of QTL and n is the number of markers.With n = 2,000 markers and n QTL = 100, we obtained r 2 mq = 0.95, which revealed that the genome was sufficiently saturated by markers.Traits with two genetic architectures were simulated, one following the infinitesimal model and the other containing five major effect genes accounting for 50% of the genetic variability.For the former, one additive effect of small magnitude on the phenotype was assigned (under the normal distribution setting) to each of the 100 QTLs.For the CF Azevedo et al. latter, small additive effects were assigned to the remaining 95 loci.The effects were normally distributed with zero mean and genetic variance (size of genetic effects) allowing the desired heritability level.The phenotypic value was obtained by adding to the genotypic value a random deviate from a normal distribution N (0, σ 2 e ), where the variance σ 2 e was defined according to two levels of narrow-sense heritabilities of approximately 0.20 and 0.35.Heritability levels were chosen to represent one trait with low heritability and another with moderate heritability, which addressed the cases where genomic selection is expected to prevail over phenotypic selection (Azevedo et al. 2015).The magnitudes of the narrow-sense and broad-sense heritability are associated with an average degree of dominance level (d/a) of approximately 1 (complete dominance) in a population with intermediate allele frequencies.Simulations assumed independence of additive and dominance effects, with dominance effects having the same distribution as the additive effects (both were normally distributed with zero mean).In the simulation, it was also observed that marker alleles had minor allele frequency (MAF) greater than 5%.

Scenarios
Four different scenarios were used in the analyses: two broad-sense heritability levels (around 0.20 and 0.35) × two genetic architectures (polygenic and mixed inheritance).The scenarios were defined as follows: Scenario 1, trait controlled by genes with small effects and heritability of 0.22; Scenario 2, trait controlled by genes with small effects and heritability of 0.37; Scenario 3, trait controlled by small and major effects genes and heritability of 0.20; Scenario 4, trait controlled by small and major effects of genes and heritability of 0.32.Each kind of population (or scenario) was simulated 10 times (Calus et al. 2008).

Eigenvectors correction
The EVG correction proposed in this study considers the eigenvectors of the genomic relationship matrix G as fixed covariates in the genomic mixed model.The eigenvectors are associated with the highest eigenvalues and first PCs of G. Thus, since G is linked to the individuals, and not to the markers, the purpose was to capture the genetic variance due to population structure.
The traditional phenotypic BLUP has been habitually applied in plant breeding to predict the genetic value (Resende and Barbosa 2006, Carvalho et al. 2008, Oliveira et al. 2011, Silva et al. 2015).If the relationship matrix A is computed via markers information (G matrix) and used within the traditional phenotypic BLUP, the method is called genomic best linear unbiased predictor (GBLUP).Thus, the GBLUP method was used and the eigenvectors from the G matrix were included in the model as follows: where y is the vector of phenotypes, v is the number of eigenvectors U i (here, 1-50) associated to the PCs with the highest eigenvalues, b is a vector of the other fixed effects with an incidence matrix X; g is a vector of additive random genetic effects of individuals, g � N(0,Gσ 2 g ) where G is the genomic additive relationship matrix G = MMʹ tr (N is the number of individuals, tr the trace operator matrix) and σ 2 g is the additive genetic variance; and e is the residuals vector, e � N(0,Iσ 2 e ) where I is the identity matrix and σ 2 e the residual variance.The eigenvectors were fitted as fixed effects, in which a i are the regression coefficients associated with them.The marker effects can be obtained by m ˆ = (Mʹ M) -1 Mʹ gˆ, M being the marker incidence matrix with values of 0, 1, and 2.

Deregressed phenotype
The traditional method used for population structure correction in genomic selection was proposed by Garrick et al. (2009) and it is called DP or adjusted phenotype.It assumes that the genealogical information, variance components, and genetic values are known and it requires a mixed model analysis based on the relationship matrix via pedigree [i.e., individual model (IM)].This correction is presented as follows: Where gˆi is the genetic value of the genotyped individual (i=1,...,N), λ*= is the average genetic value of the h and k genitors, Zʹ gm Z gm is the information content associated with the genitors average, Zʹ i Z i is the information content associated with the individuals including their progeny, and y gm and y i are the phenotypic observations corrected for fixed and individual effects, respectively.
The information matrix associated with the genitor average is unknown, and can be obtained by means of the following expression: being the reliability of the genetic value predicted for genitors h and k given by and r 2 i is the reliability associated with the predicted genetic value of the individual.Therefore, the information matrix associated with the individual Zʹ i Z i can be obtained through Zʹ gm Z gm .Thus, Zʹ i Z i is given by δZʹ gm Z gm + 2λ*(2δ -1).
The phenotype vectors corrected for fixed and genitor effects are the output of the equations yˆi = (-2λ*)yˆg m + (Zʹ i Z i + 2λ*)gˆi.Thus, the phenotype will also be corrected for the average genetic value of its genitors.After this correction, the deregressed genetic value (gˆi) is obtained by gˆi = . Analogous to EVG, the marker effects can be obtained using the GBLUP method as shown earlier in the text (i.e., m ˆ = (MʹM) -1 Mʹgˆ*).

Correction via BLUP
Population structure correction can be done by means of the additive genetic model proposed by Resende et al. (2012).The following model is considered: where y is the vector of phenotypic records, b is the vector of the fixed effects with an incidence matrix X, g is the vector of additive random genetic effects of individuals, and e is the residuals vector.The correction via the computational model BLUP assumes that the individual genealogical information is known, thus, after the genetic values prediction for each individual and their respective genitors, the corrected genetic value for the genitors can be obtained by: gˆc = gˆ -0.5gˆh -0.5gˆkwhere g ˆh and g ˆk are the predicted genetic g ˆc values of the h and k genitors, respectively.The genetic value deregressed and corrected for the genitor effects, i.e., the effect of the Mendelian segregation, is given by: is the heritability of the Mendelian segregation and h 2 the heritability of the trait.This step is necessary because the genetic values do not undergo two regressions, i.e., one based on the relationship matrix via pedigree (mixed model analysis) and another one on the marker matrix.In this context, gˆ* is equivalent to the residual (y -Xb ˆ -Z h,k gˆh ,k ) deriving from the reduced individual model, in which gˆh ,k is the predicted genetic value of the genitors.
After the above mentioned population structure correction, the marker effects can be estimated by means of the GBLUP model: where y c = g ˆ*,g** is the vector of additive random genetic effects of individuals with incidence matrix Z, which is here pre-corrected for population structure, and g** � N(0,Gσ 2 g ) with G being the genomic additive relationship matrix and σ 2 g the additive genetic variance.Analogous toother methods, the marker effects can be obtained using m ˆ = (MʹM) -1 Mʹgˆ**.

Principal component correction
and 0.70 for Scenarios 1, 2, 3, and 4, respectively) yielded comparable values for all scenarios and outperformed the other methods.Parametric accuracy values were obtained considering effective population size, chromosome size (in Morgans), number of loci, and heritability of each trait according to Grattapaglia and Resende (2011).Interestingly, the accuracy of the interval estimates obtained by correction via EVG and PC included the parametric value.On the other hand, the accuracy interval estimates obtained by correction via DP did not include the parametric value, whereas the interval estimates obtained by correction via BLUP contained the parametric value for all scenarios, with the exception of Scenario 3. The ranges of the confidence intervals for predictive ability values obtained by correction via EVG and PC were also similar, except for Scenario 4, for which the correction interval via PC had considerably high amplitude compared to that obtained via EVG.

Bias
The bias values estimated by correction via EVG and PC were bigger than 1 (Table 1), whereas by correction via DP they were smaller than 1 (Table 1), indicating that the predicted genetic values were underestimated and overestimated, respectively (Resende et al. 2012).Despite the accuracy obtained by correction via BLUP was higher than obtained via DP, the bias obtained by correction via BLUP was bigger than10, which indicates considerable underestimation of the predicted values.

Molecular heritability
Estimates of heritability in all scenarios were compared with parametric heritability values (0.12 for Scenarios 1 and  (Garrick et al. 2009).The scenarios were defined as follows: Scenario 1, trait controlled by genes with small effects and heritability 0.22; Scenario 2, trait controlled by genes with small effects and heritability 0.37; Scenario 3, trait controlled by small and major effect genes and heritability 0.20; Scenario 4, trait controlled by small and major effect genes and heritability 0.32.The scenarios are defined as follows: Scenario 1, trait controlled by genes with small effects and heritability 0.22; Scenario 2, trait controlled by genes with small effects and heritability 0.37; Scenario 3, trait controlled by small and major effect genes and heritability 0.20; Scenario 4, trait controlled by small and major effect genes and heritability 0.32.
LD information between markers and QTLs (Resende et al. 2012).According to Lehermeier et al. (2015), population structure is an important effect that should be considered in analysis, especially in plant breeding programs, for which the substructure is occasioned between breeding groups or is a consequence of artificial selection and genetic drift.
In this context, population structure effects can be seen as extra-effects affecting "pure" LD information.Thus, corrections for population structure by removing potential population structure effects are required to obtain reliable predictions in future generations (Resende et al. 2012, Albrecht et al. 2014, Guo et al. 2014).
In conclusion, among the methods analyzed in this study, the correction via EVG and PC provided the most accurate genomic breeding estimates values.The correction via BLUP considerably overestimated the genomic breeding estimates values, correction via EVG and PC moderately overestimated, and correction via DP moderately underestimated.The correction via BLUP and DP presented heritability estimates similar to the parametric heritability values.The correction via PC and EVG slightly overestimated the estimates in Scenarios 1 and 3, respectively.The individuals ranking obtained by correction via EVG proved to be considerably more efficient than that of other methods.

Figure 1 .
Figure1.Accuracy involving of predicted and parametric genetic values and the parametric values of the effects of Mendelian segregation for each scenario different scenarios obtained via the covariance analysis via eigenvector correction method (A) and via the principal component correction method (B).The scenarios are defined as follows: Scenario 1, trait controlled by genes with small effects and heritability 0.22; Scenario 2, trait controlled by genes with small effects and heritability 0.37; Scenario 3, trait controlled by small and major effect genes and heritability 0.20; Scenario 4, trait controlled by small and major effect genes and heritability 0.32.

Table 1 .
Molecular heritability (h 2 ), accuracy (r g.gˆ) , and bias estimated for each scenario using various population structure correction methods 1 Correction for population structure via reduced best linear unbiased prediction (BLUP) computational model; 2 Correction via inclusion of eigenvectors; 3 Correction via inclusion of principal components; 4 Correction via deregressed phenotype