INTRODUCTION
In genomewide selection (GWS), the correction for population structure allows eliminating the effects of genitors or families from the total estimated breeding values in order to obtain the effects of pure Mendelian segregation. Thus, association analyses between allele markers and quantitative trait loci (QTLs) capture the genetic effects consequences of linkage disequilibrium (LD) by excluding relationshipbased genealogy.
The performance of shortterm genomic prediction (at present or subsequent generations) depends on genealogical kinship more than on LD (^{Resende et al. 2012}). Therefore, predictions based on total genetic values, which consider both kinds of information, are recommended. On the other hand, longterm genomic prediction (prediction for future generations) only requires LD information between markers and QTLs, which is called Mendelian segregation effect. Thus, correction for population structure is required to obtain accurate predictions (^{Resende et al. 2012}). Structure correction is also important for genomewide association studies (GWAS) because they aim to detect causative variants and allow for association analysis between alleles and QTLs by considering only LD (free genealogy) and avoiding the occurrence of false positives (^{Daetwyler et al. 2012}).
Therefore, statistical methods for population structure correction are required for efficient longterm GWS. Among them, correction via best linear unbiased prediction (BLUP) (^{Resende et al. 2012}) and procedures based on deregressed phenotype (DP) (^{Garrick et al. 2009}) are highlighted. Covariance analysis via eigenvectors (EVG), which is based on the genomic relationship matrix and was presented by ^{Patterson et al. (2006}) and ^{Yang et al. (2011}) to correct for population structure in GWAS, has never been tested for genomic prediction. The inclusion of principal components (PC) (^{Daetwyler et al. 2012}, ^{Riedelsheimer et al. 2012}, ^{Price et al. 2006}) obtained from the markers used in the model has also been proposed for population structure correction in both GWAS and GWS. The abovementioned correction methods consist of including eigenvectors from the genomic relationship matrix G in the model as fixed covariates. The eigenvectors are associated with the highest eigenvalues and first PCs of G. Thus, since G is linked to individuals and not to markers, the objective is to capture genetic variations depending on population structure. However, from the viewpoint of genomic selection, no reports on the comparison among these four population structure correction methods have been published yet.
Based on previous evaluations of methodologies for GWS, GWAS, and QTL mapping conducted with simulation studies (^{Azevedo et al. 2015}, ^{Piccoli et al. 2014}, ^{Ribeiro et al. 2005}), we assessed the performance of the EVG method or population structure correction in terms of efficiency in estimating the free genomic values of relationships using simulated data under different scenarios (i.e., 2heritability levels × 2genetic architectures). We compared the proposed EVG method with other methods (i.e., BLUP, DP, and PC) in terms of prediction accuracy obtained from each scenario.
MATERIAL AND METHODS
Simulated datasets
The datasets generated according to ^{Azevedo et al. (2015}) were simulated using the Real Breeding software (^{Viana 2013}). Two random mating populations in linkage equilibrium were crossed to generate a population of 5,000elements from 100 families using LD, which was subjected to five generations of random mating without mutation, selection, or migration. The resultant population is an advanced generation composite, which presents HardyWeinberg equilibrium and LD. According to ^{Viana (2004}), the LD value (Δ) in a composite population is
From the advanced generation of the composite, 1,000 individuals with diploid genomes having a length of 200 centimorgans (cM) (L=2 Morgans) were generated assuming 10equally sized chromosomes, each one with two haplotypes. ^{Azevedo et al. (2015}) simulated a marker density by assigning 2,000 equidistant SNP markers that were separated by 0.1 cM across the 10chromosomes. Interestingly, 100 of the 2,000 markers were actually genes (QTL). The 1,000 individuals that came from the same generation and from 20 fullsib families (each one with 50 individuals) were genotyped and phenotyped. This simulation provided a typical, small effective population size (Ne= 39.22) and large LD in the breeding populations. Ne of approximately 40 and 50 individuals per family are typical values in elite breeding populations of allogamous plant species (^{Resende 2002}).
The QTLs were distributed among the regions covered by the SNPs. For each trait, we informed the degree of dominance (d/a, where a and d are the genotypic values for one homozygote and heterozygote, respectively) and the direction of dominance (positive and/or negative). The obtained genotypic values for homozygotes were within the limits of Gmax=100(m + a) and Gmin=100(m  a) where m is the mean of genotypic values, which are the maximum and minimum values, respectively. ^{Goddard et al. (2011}) presented the realized proportion (
Traits with two genetic architectures were simulated, one following the infinitesimal model and the other containing five major effect genes accounting for 50% of the genetic variability. For the former, one additive effect of small magnitude on the phenotype was assigned (under the normal distribution setting) to each of the 100 QTLs. For the latter, small additive effects were assigned to the remaining 95 loci. The effects were normally distributed with zero mean and genetic variance (size of genetic effects) allowing the desired heritability level. The phenotypic value was obtained by adding to the genotypic value a random deviate from a normal distribution N (
Scenarios
To evaluate the proposed accuracy estimators, we studied 4 different scenarios: 2 heritability levels (around 0.20 and 0.35) × 2 genetic architectures. These 4 scenarios were analyzed considering four forms of validation and three accuracy estimators. The scenarios were defined as follows: Scenario 1, trait controlled by genes with small effects and heritability of 0.22; Scenario 2, trait controlled by genes with small effects and heritability of 0.37; Scenario 3, trait controlled by small and major effects genes and heritability of 0.20; Scenario 4, trait controlled by small and major effects of genes and heritability of 0.32. Each kind of population (or scenario) was simulated 10 times (^{Calus et al. 2008}).
Eigenvectors correction
The EVG correction proposed in this study considers the eigenvectors of the genomic relationship matrix G as fixed covariates in the genomic mixed model. The eigenvectors are associated with the highest eigenvalues and first PCs of G. Thus, since G is linked to the individuals, and not to the markers, the purpose was to capture the genetic variance due to population structure.
The traditional phenotypic BLUP has been habitually applied in plant breeding to predict the genetic value (^{Resende and Barbosa 2006}, ^{Carvalho et al. 2008}, ^{Oliveira et al. 2011}, ^{Silva et al. 2015}). If the relationship matrix A is computed via markers information (G matrix) and used within the traditional phenotypic BLUP, the method is called genomic best linear unbiased predictor (GBLUP). Thus, the GBLUP method was used and the eigenvectors from the G matrix were included in the model as follows:
where y is the vector of phenotypes, v is the number of eigenvectors U
_{
i
} (here, 150) associated to the PCs with the highest eigenvalues, b is a vector of the other fixed effects with an incidence matrix X; g is a vector of additive random genetic effects of individuals,
Deregressed phenotype
The traditional method used for population structure correction in genomic selection was proposed by ^{Garrick et al. (2009}) and it is called DP or adjusted phenotype. It assumes that the genealogical information, variance components, and genetic values are known and it requires a mixed model analysis based on the relationship matrix via pedigree [i.e., individual model (IM)]. This correction is presented as follows:
Where
The information matrix associated with the genitor average is unknown, and can be obtained by means of the following expression:
where
The phenotype vectors corrected for fixed and genitor effects are the output of the equations
Correction via BLUP
Population structure correction can be done by means of the additive genetic model proposed by ^{Resende et al. (2012}). The following model is considered:
where y is the vector of phenotypic records, b is the vector of the fixed effects with an incidence matrix X, g is the vector of additive random genetic effects of individuals, and e is the residuals vector. The correction via the computational model BLUP assumes that the individual genealogical information is known, thus, after the genetic values prediction for each individual and their respective genitors, the corrected genetic value for the genitors can be obtained by:
where
where
After the above mentioned population structure correction, the marker effects can be estimated by means of the GBLUP model:
where
Principal component correction
The correction via PC method is traditionally used in genomic association (^{Daetwyler et al. 2012}, ^{Riedelsheimer et al. 2012}, ^{Price et al. 2006}). Similar to EVG, PC correction is based on the inclusion of fixed covariates in the GBLUP model; however, here the covariates are the PCs of the G matrix as follow:
where q is the number (here 150) of the PCs given by
Q = UG
,
Qi
represents the columns of the matrix Q, b is the vector of fixed effects with an incidence matrix X, g is the vector of additive random genetic effects of individuals, and e is the residuals vector. These components were fitted as fixed effects because β
_{
i
} represents the regression coefficients related to the PCs. The marker effects can be obtained using
Methods comparison
Four scenarios were simulated ten times, with nine of these replications considered as training populations and the remaining replication as the validation population. The EVG, PC, and BLUP methods were compared to the DP method. BLUP and the PC correspond to the population structure correction methods most used in genomic prediction.
The validation population was used to evaluate the agreement between the predicted and parametric breeding values based on Mendelian segregation (i.e., phenotypes without family effects) through four measurements: i) accuracy, given by the correlation between the predicted and the parametric breeding values (simulated); ii) estimation bias, given by the regression coefficient obtained by regression between the predicted and the parametric breeding values with 1 being the ideal value (i.e., absence of bias); iii) molecular heritability given by
Computational tools
The mixed model analysis based on the relationship matrix via pedigree was conducted using ASREML® (^{Gilmour et al. 2009}) and the computational routines of the correction methods were implemented in the R software (^{R Development Core Team 2015}) using the pedigreemm (BLUP), rrBLUP (EVG and PC corrections), base (eigenvectors), and stats (principal components) packages and the pedigreemm, mixed.solve, eigen, and princomp functions, respectively.
RESULTS AND DISCUSSION
Accuracy
The average accuracy of the Mendelian segregation effect estimates is shown in Figure 1. The curves obtained with the EVG method (Figure 1A) show that the accuracy reaches a plateau when 20eigenvectors are included in the model, whereas a smooth linear decrease is observed when accuracy is calculated using the PC method (Figure 1B). Although, the correction via PC presented higher accuracy than EVG did, the number of eigenvectors is in accordance with that obtained by ^{Janss et al. (2012}), who demonstrated that the number of eigenvectors typically used in these analyses is 10 or 20 in differently simulated stratified populations.
According to ^{Daetwyler et al. (2012}), the curve representing accuracy against the number of eigenvectors (Figure 1A) reaches a plateau only when LD takes place. Therefore, the curve in Figure 1B, which was obtained using the PC correction method and shows a linear decrease, possibly captured genetic information beyond LD, such as cosegregation (^{Azevedo et al. 2015}).
The average values of the molecular heritability estimates, accuracy, and bias obtained by performing correction via EVG (i.e., 20 eigenvectors), PC (with the inclusion of 1 principal component), BLUP, and DP are shown in Table 1. For accuracy, the correction via EVG (0.62, 0.71, 0.62, and 0.70, for Scenarios 1, 2, 3, and 4, respectively) and PC (0.61, 0.75, 0.60, and 0.70 for Scenarios 1, 2, 3, and 4, respectively) yielded comparable values for all scenarios and outperformed the other methods. Parametric accuracy values were obtained considering effective population size, chromosome size (in Morgans), number of loci, and heritability of each trait according to ^{Grattapaglia and Resende (2011}). Interestingly, the accuracy of the interval estimates obtained by correction via EVG and PC included the parametric value. On the other hand, the accuracy interval estimates obtained by correction via DP did not include the parametric value, whereas the interval estimates obtained by correction via BLUP contained the parametric value for all scenarios, with the exception of Scenario 3. The ranges of the confidence intervals for predictive ability values obtained by correction via EVG and PC were also similar, except for Scenario 4, for which the correction interval via PC had considerably high amplitude compared to that obtained via EVG.
Scenarios  Parametric  BLUP^{ 1 }  EVG^{ 2 }  PC^{ 3 }  DP^{ 4 }  

h^{2}  Scenario1  0.12  0.14_{[0.02;0.26]}  0.16_{[0.07;0.24]}  0.14_{[0.09;0.20]}  0.13_{[0.06;0.20]} 
Scenario2  0.22  0.23_{[0.16;0.29]}  0.30_{[0.21;0.39]}  0.29_{[0.21;0.37]}  0.23_{[0.16;0.29]}  
Scenario3  0.12  0.11_{[0.08;0.15]}  0.14_{[0.02;0.26]}  0.13_{[0.07;0.19]}  0.11_{[0.08;0.15]}  
Scenario4  0.22  0.19_{[0.12;0.26]}  0.27_{[0.16;0.38]}  0.23_{[0.09;0.36]}  0.19_{[0.12;0.26]}  

Scenario1  0.61  0.38_{[0.06;0.69]}  0.62_{[0.54;0.69]}  0.61_{[0.55;0.68]}  0.31_{[0.26;0.37]} 
Scenario2  0.66  0.58_{[0.15;1.00]}  0.71_{[0.65;0.77]}  0.75_{[0.69;0.81]}  0.40_{[0.35;0.46]}  
Scenario3  0.59  0.29_{[0.22;0.36]}  0.62_{[0.52;0.72]}  0.60_{[0.52;0.68]}  0.31_{[0.24;0.38]}  
Scenario4  0.64  0.60_{[0.22;0.97]}  0.70_{[0.64;0.77]}  0.70_{[0.57;0.83]}  0.38_{[0.31;0.46]}  
Bias  Scenario1    >10  2.79_{[1.86;3.74]}  1.66_{[1.46;1.85]}  0.92_{[0.43;1.41]} 
Scenario2    >10  1.87_{[1.57;2.18]}  1.41_{[1.30;1.51]}  0.77_{[0.67;0.88]}  
Scenario3    >10  3.38_{[0.61;6.14]}  1.67_{[1.55;1.79]}  0.94_{[0.61;1.27]}  
Scenario4    >10  2.11_{[1.62;2.60]}  1.47_{[1.31;1.64]}  0.82_{[0.66;0.96]} 
^{1} Correction for population structure via reduced best linear unbiased prediction (BLUP) computational model; ^{2} Correction via inclusion of eigenvectors; ^{3} Correction via inclusion of principal components; ^{4} Correction via deregressed phenotype (Garrick et al. 2009). The scenarios were defined as follows: Scenario 1, trait controlled by genes with small effects and heritability 0.22; Scenario 2, trait controlled by genes with small effects and heritability 0.37; Scenario 3, trait controlled by small and major effect genes and heritability 0.20; Scenario 4, trait controlled by small and major effect genes and heritability 0.32.
Bias
The bias values estimated by correction via EVG and PC were bigger than 1 (Table 1), whereas by correction via DP they were smaller than 1 (Table 1), indicating that the predicted genetic values were underestimated and overestimated, respectively (^{Resende et al. 2012}). Despite the accuracy obtained by correction via BLUP was higher than obtained via DP, the bias obtained by correction via BLUP was bigger than10, which indicates considerable underestimation of the predicted values.
Molecular heritability
Estimates of heritability in all scenarios were compared with parametric heritability values (0.12 for Scenarios 1 and 3 and 0.22 for Scenarios 2 and 4). In Scenarios 1 and 4, the BLUP, PC, and DP estimates were similar to the parametric heritability, whereas the EVG method yielded slightly overestimated results (Table 1). In Scenario 2, the BLUP and DP estimates were similar to the parametric heritability, whereas the EVG and PC estimates were overestimated. In Scenario3, all estimates were similar to the parametric heritability. However, the confidence intervals associated with the four methods all contained the value of the parametric heritability.
Since PCs are estimated based on the G matrix, ^{Janss et al. (2012}) and ^{de los Campos and Sorensen (2014}) reported that inclusion of PCs as fixed effects in genomic models would carry doubled information from G, which could lead to errors in the variance components and heritability estimation. ^{Price et al. (2010}) discussed and advocated the use of the principal components as covariates of fixed effects because the effect of population structure, seen as a function of genetic ancestry, is the same for all the samples. In addition, according to Price et al. (2010), if PCs are adjusted as random effects, spurious associations can be detected, especially those of unusually differentiated markers. In addition, according to ^{Tucker et al. (2014}), the restricted maximum likelihood (REML) approach, which consists in projecting the G matrix into a subspace orthogonal to the PCs, effectively removes any PC effect on the variance components and heritability estimation. This agrees with the results reported in this study, in which the heritability values estimated by PC correction are similar to the parametric values in almost all the considered simulation scenarios.
Agreement of ranking
The results obtained by parametric ranking of individuals and the ranking obtained by each of the methods analyzed are shown in Table 2. Although the predicted genetic values found by correction via EVG are more biased than those found via PC, the ranking of this method is considerably more efficient, especially in Scenario 4. In comparison with the other methods, correction via EVG is preferable for individual ranking, especially when the number of individuals is increased. The results obtained by correction via BLUP and DR differ when the number of selected individuals is reduced, but when this number increases, the results are similar.
Scenarios  Top  BLUP^{ 1 } (%)  EVG^{ 2 } (%)  PC^{ 3 } (%)  DP^{ 4 } (%) 

Scenario1  5  7  16  9  2% 
10  09  28  27  7  
50  22  30  30  21  
100  24  38  34  25  
200  39  53  48  36  
250  43  56  52  40  
500  63  72  69  61  
Scenario 2  5  13  18  18  09 
10  18  32  23  11  
50  31  35  32  22  
100  38  45  40  30  
200  48  55  48  39  
250  53  59  53  44  
500  70  75  70  63  
Scenario 3  5  00  16  13  09 
10  04  19  16  08  
50  11  31  26  14  
100  19  39  30  19  
200  31  50  44  30  
250  36  53  49  36  
500  60  71  68  61  
Scenario 4  5  22  20  09  02 
10  21  24  18  11  
50  29  36  26  16  
100  36  44  34  20  
200  47  54  45  31  
250  53  58  51  37  
500  70  75  69  62 
^{1} Correction for population structure via reduced best linear unbiased prediction (BLUP) computational model; ^{2} Correction via inclusion of eigenvectors; ^{3} Correction via inclusion of principal components; ^{4} Correction via deregressed phenotype (Garrick et al. 2009). The scenarios were defined as follows: Scenario 1, trait controlled by genes with small effects and heritability 0.22; Scenario 2, trait controlled by genes with small effects and heritability 0.37; Scenario 3, trait controlled by small and major effect genes and heritability 0.20; Scenario 4, trait controlled by small and major effect genes and heritability 0.32.
The main advantage of population structure correction for prediction, independently of the method, is that the estimated marker effects are valid for prediction in various generations, including future generations. The reestimation of new effects of markers for each generation would involve the development of a new model and heavy time investment, such as for collection of genotypic and phenotypic data and genealogy of the new generation. However, one advantage of correction via EVG and PC over other correction methods is the nonassumption of prior information such as pedigree; in fact, this information is not always available and is often associated with clerical errors. Moreover, correction using eigenvectors or components is associated with a relatively simple theory compared to the DP method, and is performed in one single step, which requires far less computational time than the other methods.
The population structure directly affects association studies and genomic prediction (^{Guo et al. 2014}). Moreover, it is especially important for longterm genomic prediction considering future generations, which requires only LD information between markers and QTLs (^{Resende et al. 2012}). According to ^{Lehermeier et al. (2015}), population structure is an important effect that should be considered in analysis, especially in plant breeding programs, for which the substructure is occasioned between breeding groups or is a consequence of artificial selection and genetic drift.
In this context, population structure effects can be seen as extraeffects affecting “pure” LD information. Thus, corrections for population structure by removing potential population structure effects are required to obtain reliable predictions in future generations (^{Resende et al. 2012}, ^{Albrecht et al. 2014}, ^{Guo et al. 2014}).
In conclusion, among the methods analyzed in this study, the correction via EVG and PC provided the most accurate genomic breeding estimates values. The correction via BLUP considerably overestimated the genomic breeding estimates values, correction via EVG and PC moderately overestimated, and correction via DP mode rately underestimated. The correction via BLUP and DP presented heritability estimates similar to the parametric heritability values. The correction via PC and EVG slightly overestimated the estimates in Scenarios 1 and 3, respectively. The individuals ranking obtained by correction via EVG proved to be considerably more efficient than that of other methods.