Genomic prediction with the additive-dominant model by dimensionality reduction methods

The objective of this work was to evaluate the application of different dimensionality reduction methods in the additive-dominant model and to compare them with the genomic best linear unbiased prediction (G-BLUP) method. The dimensionality reduction methods evaluated were: principal components regression (PCR), partial least squares (PLS), and independent components regression (ICR). A simulated data set composed of 1,000 individuals and 2,000 single-nucleotide polymorphisms was used, being analyzed in four scenarios: two heritability levels × two genetic architectures. To help choose the number of components, the results were evaluated as to additive, dominant, and total genomic information. In general, PCR showed higher accuracy values than the other methods. However, none of the methodologies are able to recover true genomic heritabilities and all of them present biased estimates, underor overestimating the genomic genetic values. For the simultaneous estimation of the additive and dominance marker effects, the best alternative is to choose the number of components that leads the dominance genomic value to a higher accuracy.


Introduction
The genome-wide selection (GWS), conceived by Meuwissen (2001), assumes the existence of a linkage disequilibrium between markers and quantitative trait loci (QTLs), which makes it possible to estimate the genomic values of individuals from the estimation of marker effects on their phenotype, capturing the genotypic information that may influence their phenotypic variability (Goddard & Hayes, 2007). However, the implementation of GWS to assess individual genomic estimated breeding values (GEBVs) faces some statistical challenges, such as multicollinearity and highly correlated markers, which decreases the probability of a single nucleotide occurring independently of another in the same position of the genome (Gianola et al., 2003). Moreover, according to the same authors, due to the high cost of individual genotyping techniques, the number of individual observations is generally much lower than the number of markers.
Several methods -including Bayesian methods as the Bayesian least absolute shrinkage and selection operator (BLASSO) method, the mixed-model method, the genomic best linear unbiased predictor (G-BLUP) method, and dimensionality reduction methods such as principal components regression (PCR), partial least squares (PLS), and independent components regression (ICR) -have been applied to GWS and are recommended for genomic prediction Azevedo et al., 2014Azevedo et al., , 2015b. These methodologies guarantee the absence of multicollinearity between their components and consider the marker effects as fixed, being a solution for the statistical problems related to the high dimensionality of GWS (Resende et al., 2012).
The dimensionality reduction methods also stand out for presenting a great applicability and relatively simple theory when compared with the other methods applied to GWS (Long et al., 2011;Azevedo et al., 2013). However, those methodologies have only considered additive genomic effects (Long et al., 2011;Azevedo et al., 2013Azevedo et al., , 2014Azevedo et al., , 2015aDu et al., 2018).
The objective of this work was to evaluate the application of different dimensionality reduction methods in the additive-dominant model and to compare them with the G-BLUP method.

Materials and Methods
The used data set was simulated by Azevedo et al. (2015b) and described by Costa (2018). A population of 5,000 individuals from 100 families, generated from the random mating of two linkage equilibrium populations, was subjected to five generations of random mating without selection, mutation, or migration. The final population is an advanced generation composite in Hardy-Weinberg equilibrium and linkage disequilibrium. For marker density, Azevedo et al. (2015b) simulated a total of 2,000 equidistant single-nucleotide polymorphisms (SNPs) as biallelic markers, separated by 0.10 cM, on ten chromosomes. Marker alleles had a minor allele frequency greater than 5%. Of the 2,000 markers, 100 were randomly chosen to be QTLs. The linkage disequilibrium between the markers and QTLs was calculated according to Goddard et al. (2011) and was equal to 0.95. A total of 1,000 individuals from 20 full-sib families, with 50 individuals each one, were phenotyped for four traits and then genotyped. This simulation was made to mimic an elite breeding population of plant species with an effective size of approximately 40.
Traits were simulated with two genetic architectures: one following an infinitesimal model (polygenic inheritance) and the other with five major effect genes, responsible for 50% of the genetic variability (mixed inheritance). In the first genetic architecture, lowmagnitude effects on phenotype were assumed for each of the 100 QTLs, whereas, in the second, large effects were assigned to 5 QTLs representing 50% of the genetic variability and small effects were assigned to the remaining 95 QTLs.
The additive and dominant effects (SNPs and QTLs) were normally distributed with zero mean and genetic variance according to the desired heritability levels. The additive variances were 35, 35, 49, and 49, while the dominance variances were approximately 18, 23, 25, and 33 (Table 1). The simulations assumed independence between additive and dominance effects. For each trait, the average degree of dominance level was approximately 1 (complete dominance) in a population with intermediate allele frequencies. The obtained genotypic values were within the limits of Gmax = 100(m + a) and Gmin = 100(m -a), which are the maximum and minimum values, respectively, where m is the mean of the genotypic values and a is the homozygote genotypic value.
In order to obtain the phenotypic value, an environmental effect was added to the genotypic value. This effect was obtained from the normal distribution N(0, σ 2 e ), where the σ 2 e variance was defined according to two levels of narrow-sense heritability (additive heritabilities of 0.20 and 0.30, respectively) and two levels of broad-sense heritability (additive heritability plus dominance heritability of 0.30 and 0.50, respectively) ( Table 1). Heritability levels were chosen to represent traits with low and moderate heritabilities, in which case GWS is expected to be superior to phenotypic selection (Azevedo et al., 2015b). Therefore, for the populations of full-sib families, four scenarios were studied: two broad-sense heritability levels of approximately 0.30 and 0.50 × two genetic architectures (Table 1). Each scenario was simulated ten times.
The general model for genomic selection was given by: y = 1μ + Wm a + Sm d + e, where y is the vector of phenotypic observations with dimension I × 1, with I being the number of individuals; μ is the general mean of the trait and 1 is a vector of the same dimension of y where all elements are equal to one; W is the incidence matrix of additive effects with dimension I × J, with J being the number of markers; m a are the additive effects of the markers; S is the incidence matrix of the dominance effect with dimension I × J; m d are the dominance marker effects; and e is the vector of random errors with a variance structure given by e~N(0,I e σ 2 e ), with I e being the identity matrix and σ 2 e , the residual variance. The W and S matrices were coded according to Vitezica et al. (2013) and their juxtaposition was defined by the X matrix as X = [W|S] (dimension I × 2J) and the marker effects, as m = [m a |m d ]' (2J × 1). For the estimation of these effects, the dimensionality reduction methods and G-BLUP were used as detailed below.
The PCR method defined the principal components (PC) as Z = XP, where Z is the matrix with the first n PCR PC, which are orthogonal, and P is the matrix with the first n PCR eigenvectors of the variance matrix of (Ferreira, 2018).
The PLS method decomposes the X matrix and the y vector simultaneously. For this, the Y and Xj's variables are centered on the mean, defining variables U 1 and V 1j , where: u y y 1 and V x x j j j 1( ) , for j = 1,...,2J, respectively. Then, the S 1 variable is written as s 1 = V 1 'u 1 (dimension 2J × 1), where V 1 = [v 11 v 12 ... v 12J ], and applies to the singular value decomposition in the s 1 vector as follows: s 1 = L 1 kq' 1 , where L 1 is a unit matrix (2J × 2J) with the first column vector equal s 1 1 s to (normalized s 1 vector), k 1 is a vector (2J × 1) with the first value equal to s 1 (norm of vector s 1 ), and q 1 is a scalar equal to 1. The first component, T 1 , is defined by t 1 = V 1 L 1 (Garthwaite, 1994). The information about variables X j and Y that are not covered by component T 1 can be estimated by the residuals of the regression between the X j and T 1 variables or, equivalently, by the regression between the V 1j and T 1 latent (unobservable) variables (Garthwaite, 1994). Therefore, to define the second component, T 2 , the V 2(j) and U 2 variables were determined, respectively, by:  u 2 are the residuals, and r 1 and p 1 are the coefficients obtained from the regression between u 1 and t 1 and v 1(j) and t 1 , in that order. The S 2 variable is defined as: s 2 = V 2 'u 2 , and the procedure applied in s 1 is repeated to construct the T 2 component. The t 3 ,...,t nPLS (1≤n PLS ≤min(I,2J) -1) components were determined successively and analogously to the above, all being considered orthogonal (Garthwaite, 1994). For ICR, proposed under the context of GWS in additive models by Azevedo et al. (2013), but also valid for additive-dominant models, the X data matrix, whose values are centered on the mean, is decomposed into X = SA', where S(I × min(2J,I) -1) is the matrix of independent components and A(2J × min(2J,I) -1) is the mixture matrix. The A matrix is a function of two matrices, K and R; the first is obtained by the whitening process, making the covariance matrix of X equal to the identity matrix, so that matrix A is orthogonal (Yao et al., 2012), and the second is obtained by an iterative algorithm based on the principle of maximum entropy (Hyvärinen, 1998). After the convergence of this algorithm, the R matrix that guarantees the independence of the columns of was obtained. Therefore, the independent components are defined as S = XKR.
Subsequently, a multiple linear regression was performed between the Y variable and the Z, T, and S components obtained by PCR, PLS, and ICR, respectively. The following predictions were assumed: obtained previously are not associated with the original variables (molecular markers), that is, they do not have a biological interpretation. The original estimates of the marker effects (fixed effects in the reduction methods) are given by: J through PCR, PLS, and ICR, respectively (Azevedo et al., 2013). It should be noted that the expressions for estimating the marker effects depend on the choice of the number of components.
In this work, to determine the number of components to be used for PCR, PLS, and ICR, the exhaustive criterion adopted by Azevedo et al. (2013Azevedo et al. ( , 2014Azevedo et al. ( , 2015a in the context of additive models was used. In additivedominant models, this criterion consists of choosing the number of components that leads to a higher accuracy in the prediction of additive (a), dominance (d), or total (g = a + d) genomic values. However, ICR, even in additive models, requires a high computational demand to perform the analyzes by this criterion (Costa, 2018). Therefore, an alternative criterion for ICR, called optimized criterion, was adopted in the present study. It was proposed by Costa (2018) in the context of additive models and consists in obtaining a number of independent components equal to the number of principal components.
G-BLUP was based on an individual-level model given by: y = 1μ + Za + Zd + e, where a is the vector of the additive genomic effects of the individuals (I × 1), with an incidence matrix Z (I × 1), for which the variance structure is given by being the additive variance and G a (I × 1), the additive genomic relationship matrix; d is the vector of the dominance genomic effects of the individuals, with an incidence matrix Z (I × I), for which the variance structure is given by d N G d d The dimensionality reduction methods and G-BLUP were compared through a cross-validation process in which the first nine replicates were assumed as estimation populations (different populations) used to estimate the marker effects (additive or dominance) and the tenth replicate was assumed as a population of validation. All efficiency measures (heritability, Pesq. agropec. bras., Brasília, v.55, e01713, 2020 DOI: 10.1590/S1678-3921.pab2020.v55.01713 accuracy, regression coefficients between phenotypes and simulated genetic values, and relative efficiency in relation to G-BLUP) were obtained for each replicate in each scenario, considering all three genetic information (additive, dominance, or total), and the general results were reported as average values. The expressions for additive heritability (narrow-sense heritability), dominance heritability (proportion of dominance variance to phenotypic variance), total heritability (broad-sense heritability), accuracy (additive and dominance), regression coefficient between phenotypes and simulated genetic values (additive and dominance), and relative efficiency are shown in Table 2 and were also used by Azevedo et al. (2015b).
All computational routines were implemented with the R software (R Core Team, 2019), using the packages: sommer, for G-BLUP; pls, for PLS and PCR; and caret, for ICR. The computer configuration was Intel (R) Core (TM) i7-6500, 2.50 GHz, 16 Gb RAM.

Results and Discussion
When the dimensionality reduction methods were evaluated only considering the number of components based on dominance genomic values, ICR, under both the optimized and exhaustive criteria, presented, on average, a lower accuracy, followed by PLS and lastly by PCR (Tables 3, 4, 5, and 6). Considering all scenarios, PLS was, on average, 1.25% more efficient than G-BLUP for additive values and 6.5% less efficient for dominance values. PCR was, on average, 4.25 and 14.75% more efficient than G-BLUP for additive and dominance values, respectively. Specifically, considering additive effects, PLS was more efficient than G-BLUP in scenarios 1 and 2 of polygenic inheritance, whereas PCR was more efficient than G-BLUP in scenarios 1, 2, and 4. When considering dominance effects, PLS was more efficient than G-BLUP in scenarios 1 and 3, both with a heritability of 0.30, and PCR was more efficient than G-BLUP in all scenarios. Furthermore, PCR presented better results than ICR when explanatory variables (molecular markers) only showed a linear dependence (Azevedo et al., 2014). Therefore, the obtained results suggest that there is a greater proportion of linear dependence in the linkage disequilibrium structure between loci (Smith, 2020).
In addition to accuracy, other measures that can be evaluated are heritability and prediction bias Daetwyler et al., 2013;Gianola, 2013). The PLS and PCR methods underestimated the additive and dominance heritabilities, while G-BLUP underestimated additive heritability and overestimated dominance heritability in most scenarios. Therefore, these methods were not able to capture the additive and dominance heritabilities that were simulated. Regarding bias, it was defined as one minus the regression coefficient between the estimated genetic value (additive and dominance) and the phenotype, that is, coefficients equal to 1 indicated nonbiased genomic values. For regression coefficients below one (< 1), it was understood that the  respectively, additive accuracies of the dimensionality reduction methods and of G-BLUP; and r ddRM  and r dd  G-BLUP , respectively, dominance accuracies of the dimensionality reduction methods and of G-BLUP.

Regression coefficients
Pesq. agropec. bras., Brasília, v.55, e01713, 2020 DOI: 10.1590/S1678-3921.pab2020.v55.01713 Table 4. Additive, dominance, and total simulated heritabilities ( h s 2 ), number of components (Nc), additive and dominance heritabilities ( h aM 2 and h dM 2 ), additive and dominance accuracies ( r aa  and r dd  ), regression coefficients (   b ya and   b yd ), and relative additive and dominance efficiencies (EF a and Ef d ) obtained for the dimensionality reduction methods and the genomic best linear unbiased prediction (G-BLUP) method in scenario 2 of polygenic inheritance (with traits controlled by small gene effects), considering the additive, dominance, and total genomic values as targets. (1) ICR, independent components regression; PCR, principal components regression; and PLS, partial least squares.
(2) The exhaustive criterion consists in obtaining a number of components equal to the number of components that leads ICR, PCR, and PLS to a higher accuracy. The optimized criterion consists in obtaining a number of independent components equal to the number of main components that leads PCR to a higher accuracy. Table 3. Additive, dominance, and total simulated heritabilities ( h s 2 ), number of components (Nc), additive and dominance heritabilities ( h aM 2 and h dM 2 ), additive and dominance accuracies ( r aa  and r dd  ), regression coefficients (   b ya and   b yd ), and relative additive and dominance efficiencies (EF a and EF d ) obtained for the dimensionality reduction methods and the genomic best linear unbiased prediction (G-BLUP) method in scenario 1 of polygenic inheritance (with traits controlled by small gene effects), considering the additive, dominance, and total genomic values as targets. (1) ICR, independent components regression; PCR, principal components regression; and PLS, partial least squares. (2) The exhaustive criterion consists in obtaining a number of components equal to the number of components that leads ICR, PCR, and PLS to a higher accuracy. The optimized criterion consists in obtaining a number of independent components equal to the number of main components that leads PCR to a higher accuracy.
Pesq. agropec. bras., Brasília, v.55, e01713, 2020 DOI: 10.1590/S1678-3921.pab2020.v55.01713 Table 6. Additive, dominance, and total simulated heritabilities ( h s 2 ), number of components (Nc), additive and dominance heritabilities ( h aM 2 and h dM 2 ), additive and dominance accuracies ( r aa  and r dd  ), regression coefficients (   b ya and   b yd ), and relative additive and dominance efficiencies (EF a and EF d ) obtained for the dimensionality reduction methods and the genomic best linear unbiased prediction (G-BLUP) method in scenario 4 of mixed inheritance (with traits controlled by major and small gene effects), considering the additive, dominance, and total genomic values as targets. (1) ICR, independent components regression; PCR, principal components regression; and PLS, partial least squares.
(2) The exhaustive criterion consists in obtaining a number of components equal to the number of components that leads ICR, PCR, and PLS to a higher accuracy. The optimized criterion consists in obtaining a number of independent components equal to the number of main components that leads PCR to a higher accuracy. Table 5. Additive, dominance, and total simulated heritabilities ( h s 2 ), number of components (Nc), additive and dominance heritabilities ( h aM 2 and h dM 2 ), additive and dominance accuracies ( r aa  and r dd  ), regression coefficients (   b ya and   b yd ), and relative additive and dominance efficiencies (EF a and EF d ) obtained for the dimensionality reduction methods and the genomic best linear unbiased prediction (G-BLUP) method in scenario 3 of mixed inheritance (with traits controlled by major and small gene effects), considering the additive, dominance, and total genomic values as targets. (1) ICR, independent components regression; PCR, principal components regression; and PLS, partial least squares.
(2) The exhaustive criterion consists in obtaining a number of components equal to the number of components that leads ICR, PCR, and PLS to a higher accuracy. The optimized criterion consists in obtaining a number of independent components equal to the number of main components that leads PCR to a higher accuracy.
Pesq. agropec. bras., Brasília, v.55, e01713, 2020 DOI: 10.1590/S1678-3921.pab2020.v55.01713 predicted values had been overestimated, whereas, for those above one (> 1), it was concluded that the predicted values had been underestimated (Azevedo et al., 2015b). The PLS, PCR, and G-BLUP methods overestimated the additive values and underestimated those due to dominance. In additive models, Azevedo et al. (2014) and Azevedo et al. (2015a) found that PCR and PLS also led to overestimated additive genomic values in the genomic prediction of carcass traits in pigs. Although the bias is essential to determine the genetic merit of individuals, it does not influence their ranking and, subsequently, the selection process (Resende et al., 2012). The main difference between the PCR and PLS methods is that PCR uses only the explanatory variables (molecular markers) for the construction of the components and PLS uses the explanatory variables and the response of the variables (phenotypes) (Garthwaite, 1994). Since the exhaustive criterion aims to choose the number of components associated with a higher accuracy, it is expected that PLS require fewer components than PCR (Du et al., 2018). However, even with a higher number of components, PCR provided a reduction of 86.50% in the original data in scenario 2, when considering the dominance genomic value as a target. It should be noted that crossvalidation was carried out to protect overfitting and overparameterization (James et al., 2013).
On average, higher additive and dominance accuracies were observed for ICR (with the exhaustive and optimized criteria), PCR, and PLS (Table 7), when considering the additive and total genomic values as targets. According to Huang & Mackay (2016), the additive variance explains a greater proportion of genetic variance, even under dominant gene action, as shown by the parameterization of the marker incidence matrix for additive and dominance effects used in the present study. This is because the additive variance is maximized initially and the dominance variance is the residue of the total genetic variation. In this way, Huang & Mackay (2016) showed that prioritizing nonadditive gene actions can capture the majority of genetic variation. Moreover, Falconer & Mackay (1996) reported that an accurate estimation of dominance effects can improve genetic gain in breeding programs. Therefore, if the interest is specifically one of the additive or dominance effects, the analysis must be based on the target genomic information; however, the obtained results suggest that, based on the dominance genomic values, it is possible to better estimate both information simultaneously. This shows that the effective selection of parents, crosses, and clones can occur based only on the efficient estimation of the additive and dominance effects (Azevedo et al., 2015b).
In additive models, ICR presented better results than the other dimensionality reduction methods (Azevedo et al., 2013(Azevedo et al., , 2014(Azevedo et al., , 2015a. Although, in the present study, the additive-dominant models had the worst performance, further researches with ICR are still necessary, especially in cases of nonlinear dependence between markers. However, ICR requires a high computational effort, which is not feasible in the genomic prediction of breeding programs, requiring a reduction in computational time without a relevant loss in the efficiency of the method. In the present work, for ICR, the computational time for each replicate in each scenario lasted 221 hours using the exhaustive criterion, but was drastically reduced to about 0.18 hour with the optimized criterion. It is worth mentioning that, in most analyzes, this reduction in time did not result in a relevant loss of accuracy, as also observed for additive models by Costa (2018). The G-BLUP method presented a shorter computational analysis time of about 0.09 hour, followed by PCR, with 0.10 hour, and PLS, with 0.12 hour.
The average results of the ICR method showed that, considering the dominance genomic value as a target, the optimized criterion led to an additive accuracy equal to or higher than that of the exhaustive criterion in scenario 3 of mixed inheritance and heritability of 0.30, with additive accuracies of 0.48 and 0.40, respectively. The dominance accuracies obtained by the optimized and exhaustive criteria were equal, except in the scenario with polygenic inheritance and heritability of 0.50; in this case, the additive accuracies were 0.26 and 0.31, respectively.
Regarding biases, in general, both the exhaustive and the optimized criteria under the additive-dominant model led to biased additive and dominance genomic values, that is, showing a regression coefficient far from 1 (Resende et al., 2012). In additive models, Azevedo et al. (2014Azevedo et al. ( , 2015a concluded that the ICR with the exhaustive criterion also resulted in biased additive genomic values. Methods or criteria that lead to the same accuracy provide the same classification of individuals, even if one results in nonbiased genomic Pesq. agropec. bras., Brasília, v.55, e01713, 2020 DOI: 10.1590/S1678-3921.pab2020.v55.01713 Table 7. Mean results of the ratio between the additive and dominance accuracies of the dimensionality reduction methods in relation to the target genomic value (GV) in the evaluated scenario, according to the used criteria. (1) Scenario 1, traits controlled by 375 small gene effects, with polygenic inheritance; scenario 2, traits controlled by small gene effects, with polygenic inheritance; scenario 3, traits controlled by 398 major and small gene effects, with mixed inheritance; and scenario 4, traits controlled by 413 major and small gene effects, with mixed inheritance.
(2) ICR, independent components regression; PCR, principal components regression; and PLS, partial least squares. (3) An exhaustive criterion consists in obtaining a number of components equal to the number of components that leads ICR, PCR, and PLS to a higher accuracy. The optimized criterion consists in obtaining a number of independent components equal to the number of main components that leads PCR to a higher accuracy.
Considering the dominance genomic value as a target, the additive heritabilities estimated using the optimized criterion were either similar in scenarios 2 and 4, with a heritability of 0.50, or greater in scenarios 1 and 3, with a heritability of 0.30, when compared with the exhaustive criterion. Using both criteria, the dominance heritability estimates were similar in all scenarios. However, ICR was not able to capture the additive and dominance heritabilities that were simulated. Since the reduction methods assume that the markers in the model are fixed effects, an alternative for calculating heritability is precisely by the estimates of marker effects. However, if the genomic values are biased, it is inferred that the effects of the markers will also be biased. It is likely that this bias exists in the additive-dominance models because the estimates of dominance variance are less accurate and require much more information to be estimated (Toro & Varona, 2010). These results suggest that the ICR method was inefficient for estimating additive and dominance heritabilities in the evaluated scenarios. Regarding the two genetic architectures and the two levels of heritability simulated, it was not possible to find any pattern in the obtained results.

Conclusions
1. Under the additive-dominant model, the relative efficiency of the principal components regression is higher in terms of accuracy, compared with the genomic best linear unbiased prediction (G-BLUP) and the other dimensionality reduction methods evaluated.
2. None of the assessed methods (G-BLUP, principal components regression, independent components regression, and partial least squares) capture the simulated heritabilities and all of them show biased additive and dominance genomic values.