The use of weighted multiple linear regression to estimate QTL-by-QTL epistatic effects

Knowledge of the nature and magnitude of gene effects, as well as their contribution to the control of metric traits, is important in formulating efficient breeding programs for the improvement of plant genetics. Information concerning a genetic parameter such as the additive-by-additive epistatic effect can be useful in traditional breeding. This report describes the results obtained by applying weighted multiple linear regression to estimate the parameter connected with an additive-by-additive epistatic interaction. Three weight variants were used: (1) standard weights based on estimated variances, (2) different weights for minimal, maximal and other lines, and (3) different weights for extreme and other lines. The approach described here combines two methods of estimation, one based on phenotypic observations and the other using molecular marker data. The comparison was done using Monte Carlo simulations. The results show that the application of weighted regression to the marker data yielded estimates similar to those obtained by phenotypic methods.


Introduction
Knowledge of the nature and magnitude of gene effects, as well as their contribution to the control of metric traits, is important in formulating efficient breeding programs for plant genetic improvement. The inheritance of quantitative traits has been described as a "moving target" since these traits are affected not only by the actions of multiple individual genes but also by the interaction between genes (Lukens and Doebley, 1999). In this context, information on additive-by-additive interaction epistatic effects can be useful in traditional breeding programs (Sharmila et al., 2007;da Silva Guimarães et al., 2010;Lau and Muniandy, 2012).
Epistasis is a phenomenon in which the effects of one gene are modified by one or several other genes. The gene whose phenotype is expressed is referred to as epistatic whereas the phenotype that is altered or suppressed is referred to as hypostatic. Epistasis can be contrasted with dominance, which involves an interaction between alleles at the same gene locus, or with an additive effect, which involves an interaction between alleles at the other gene locus.
The term epistasis was first described by Bateson (1907) to denote the suppression of gene expression at one locus by a gene at another locus. Fisher (1918) used the term epistacy to refer to a statistical interaction in the sense of a deviation from additive effects in the statistical linear model. Cordell (2002Cordell ( , 2009 has argued that the statistical tests that are often used to assess interactions (Hahn et al., 2003;Moore, 2004;Chung et al., 2007;Zhang and Liu, 2007;Gayán et al., 2008) are of limited use in elucidating the type of biological interaction that Bateson had originally conceived. Phillips (2008) discussed the ambiguity in the term epistasis and distinguished what he considered to be three forms of epistasis, i.e., 1) statistical epistasis, which describes a departure from additive effects in a statistical model, 2) compositional epistasis, which corresponds to epistasis in Bateson's original sense of the term, i.e., the masking of the effect of an allele at one locus by an allele at another locus, and 3) functional epistasis, which describes physical molecular interactions between various proteins (and other genetic elements). Jannink and Jansen (2001) described a contrast approach for building linear models and a likelihood method to test for the presence of epistasis.
method, based on marker observations. One of the conclusions of these studies was that the estimate of the total additive-by-additive interaction effect based on the marker data was in most cases smaller than that based on the phenotype.
The explanation of this phenomenon may be simple. Phenotypic data can be used to estimate only the total epistatic effect of all hypothetical gene pairs that determine the trait. By using marker data, which can be more or less precisely mapped in the genome, we can estimate individual effects of gene-by-gene interactions while keeping the number of QTL-by-QTL interactions intentionally low for practical reasons. The sum of the obtained effects is lower than the phenotypic estimate and this difference may be enhanced by the absence of markers in the regions where genes are located.
In addition to the above explanation, other possible sources of the differences between calculated estimates should be considered. The results mentioned above relate to QTL-by-QTL interaction effects obtained by the simplest possible methods, i.e., by multiple linear regression on the marker data. In the present paper, modification of this regression by using empirical weights is shown to provide better agreement between the phenotypic and genotypic estimates.
In the approach described here, estimation of the parameter connected with the additive-by-additive epistatic interaction effect was based on extreme groups of homozygous lines and on data for genotypic markers. Weighted multiple linear regression was used to estimate the QTL-by-QTL epistatic effects by employing: (1) standard weighted regression with weights based on estimated variances, (2) different weights for minimal, maximal and other lines, and (3) different weights for extreme lines and different weights for other lines. The effectiveness of the proposed method was investigated by using Monte Carlo simulations.

Methods
Estimation of the effect of additive-by-additive interactions on gene action As a starting point, consider a population of n significantly different homozygous (doubled haploid -DH or recombinant inbred -RI) plant lines from a cross between two homozygous parents, as is currently practiced in the breeding of selfing species. For this population, an n-vector of phenotypic mean observations y = [y 1 y 2 ... y n ]' and q n-vectors of marker genotype observations m l , where l = 1, 2, ..., q, are obtained. The i-th element (i = 1, 2, ..., n) of vector m l is equal to -1 or 1, depending on the parental genotype exhibited by the i-th line.

Estimation based on the phenotype
Estimation of the additive-by-additive interaction of homozygous loci aa (epistasis) (Kearsey and Pooni, 1996) based on phenotypic observations y requires the identification of groups of extreme lines, i.e., lines with minimal and maximal expression of the observed trait (Choo and Reinbergs, 1982). The group of minimal lines consists of lines that theoretically contain only alleles that reduce the value of the trait. Analogously, the group of maximal lines contains lines connected only with alleles that increase the trait value. In this report, the groups of extreme lines are identified by using the quantile method (Bocianowski et al., 1999) in which lines with mean values £ 0.03 or ³ 0.97 of the quantile corresponding to the empirical distribution of means were assumed to represent minimal and maximal lines, respectively. The total additive-by-additive interaction effect of aa may be estimated by the following formula (Surma et al., 1984) ( ) where L max and L min denote the means for the groups of maximal and minimal lines, respectively, and L denotes the mean for all lines.

Estimation based on the genotype
The estimation of aa was based on the assumption that genes responsible for the trait were marked perfectly as observed molecular markers. By choosing from all the observed markers p the variability of the trait and model observations for the lines can be defined as where 1 is the n-dimensional vector of ones, m is the general mean, X is the (n´p)-dimensional matrix of the form If G is of full rank, the estimate of a is given by (Searle, 1982): where W -1 is a diagonal matrix of unknown variances of observations that may be found empirically by estimation. The total additive-by-additive epistatic effect of genes that influence the trait is defined as: The markers for model (2) may be selected by a stepwise regression procedure (Charcosset et al., 2001). For this, a two-stage algorithm was used in which: (i) selection was done by using a forward-backward stepwise, multiple linear regression with a probability into and out of the model of 0.01 independently within all linkage groups, and (ii) markers selected in this way were placed in one group and subjected to the second backward selection (see Jansen and Stam, 1994). The final genetic model incorporated significant additive and epistatic effects. A QTL was declared if the phenotype was associated with a marker locus at p < 0.001. For the weights, three forms of matrix W were proposed, as follows: Proposal 1 -Ordinary weighted multiple linear regression, with weights based on the estimated variance. The matrix W is defined as: where w i is the estimated variance for the i-th line, i = 1,2, ..., n.
Proposal 2 -Weighted regression with different weights for minimal, maximal and other lines. The elements of matrix W are: where w min is the estimated average of variances for minimal lines, w max is the average of variances for maximal lines and w other is the average of variances for other lines. where In the case of unweighted regression all the elements of matrix W are equal to 1.

Simulation studies
In the Monte Carlo simulations that compared the phenotypic and genotypic (unweighted and three forms of weighted) estimates of the additive-by-additive interaction of QTL effects the following variants of the selected parameters were adopted. The true value of the parameter was assumed to be equal to five (aa = 5) and the total mean value of the trait to 100. The number of QTLs affecting the trait was five (each with an additive effect of two). A QTL was marked perfectly so that the marker and QTLs were in complete linkage disequilibrium with no recombination. At least two markers without additive effects were located between two QTLs. In all, 200 homozygous lines and 210 markers were analyzed. Markers were located in five, seven or 10 linkage groups (LG) that contained 42, 30 or 21 markers, respectively. Distances between markers were all equal (10 cM) or unequal (for five LG: 8, 9, 10, 11 and 12 cM; for seven LG: 8,9,9.5,10,10.5,11 and 12 cM;for 10 LG: 8,8.5,9,9.5,10,10.5,11,11.5,12 and 12.5 cM). The number of QTL-QTL pairs with additive-by-additive epistatic effects that affected the trait was assumed to be one, two or five. The QTLs with additive-by-additive epistatic effects were located in one LG or distributed over the whole genome (each QTL was in a different LG). The effects of particular pairs of genes were assumed to be equal for all pairs or one QTL-QTL pair effect was much larger than the other (for two pairs: 4 and 1; for five pairs: 2, 1, 1, 0.5 and 0.5). The variance for lines was assumed to include four variants: V1 -equal for all lines (10), V2greater for extreme lines (10) than for other lines (5), V3smaller for extreme lines (5) than for other lines (10) and V4 -different for minimal lines (10), maximal lines (20) and other lines (30). The error variance was equal to five for all generated observations. For each combination of parameters, 5000 random generations were obtained for the data set containing the vector of phenotypic observations and vectors of marker genotype observations. For each data set the line variances were estimated and the total additive-byadditive interaction effect was estimated by the phenotypic method (aa f ) and by unweighted (aa ĝ 0 ) and weighted ( , , )^âa aa aa g g g 1 2 3 versions of the genotypic method. The results were summarized as mean values for a series of simulations and mean squared errors.

Results
Different experimental situations were examined in this large-scale simulation. Tables 1, 2, 3 and 4 show the results of simulations done to compare the estimates of addi- 804 Weighted regression and epistasis tive-by-additive epistatic interactions obtained by the five methods: the phenotypic method and four genotypic methods (one unweighted and three forms of weighted) in which the distances between markers were the same (10 cM) and the number of linkage groups was five.
The results of this simulation showed that in all the four variants of variance for lines [equal for all lines (Table 1), greater for extreme lines than for other lines (Table 2), smaller for extreme lines than for other lines (Table 3) and different for minimal lines, maximal lines and other lines (Table 4)] the mean phenotypic estimate was always greater than the genotypic estimates (regardless of the method used). Genotypic estimates based on weighted regression were always greater than the unweighted estimates. The mean values for aa ĝ 2 and aa ĝ 3 were always slightly larger than aa ĝ 0 , whereas the mean values for aa ĝ 1 were much larger than aa ĝ 0 and closest (among aa ĝ ) to aa f . The relationship aa f > aa ĝ 1 > aa ĝ 2 > aa ĝ 3 > aa ĝ 0 was generally observed, except when five QTL-QTL epistatic pairs were assumed to be present in many linkage groups, in which case aa ĝ 1 > aa f . The weighted estimates were always closer to the phenotypic estimates than the unweighted estimates.
The last five columns of Tables 1-4 show that a decrease in the estimates was accompanied by an increase in their mean squared error. Simulations were also done for the groups containing seven and ten LG, and for unequal distances between markers. The results of these analyses (not shown) were similar to those described above.

Discussion
Most plant traits are quantitative in nature and are influenced by many genes of quantitative trait loci (QTL). Quantitative traits are also influenced by the environment and tend to show varying degrees of genotype´environment interaction. Epistasis, or interaction between nonallelic genes, is an important factor that affects the phenotypic expression of genes and the genetic variation in populations (Li et al., 1997). An increasing body of knowledge on biological pathways and gene networks implies that gene-gene interactions (epistasis) are important and several reports have recently argued that much genetic variance in populations is due to such interactions (Carlborg and Haley, 2004;Marchini et al., 2005;Nadeem and Azhar, 2005;Bhatti et al., 2006a,b;Evans et al., 2006;Lin et al., 2008;Ullah et al., 2010;Shakoor et al., 2010). Mapping and estimating epistatic quantitative trait loci is a difficult and serious challenge and most of the existing methods for the detection of epistasis use a higher dimensional search (Chase et al., 1997;Holland, 1998;Kao et al., 1999;Zeng et al., 1999;Carlborg et al., 2000;Sen and Churchill, 2001).   The present report is the first to compare a phenotypic estimator of additive-by-additive epistatic interaction effect with genotypic estimators obtained using weighted multiple linear regression based on a simulation study. The combination of parameters adopted in the Monte Carlo simulation study corresponded to cases often encountered in real studies.
Quantitative geneticists have long recognized the importance of gene-by-gene interactions and these have been documented for numerous crops and for various traits. Information on epistasis related to additive effects will be useful in traditional breeding. In breeding programs the use of a superior genotype in various environments when this superiority has been predicted based on QTL information obtained only in one environment (Lin et al., 2008). Epistasis is a challenge to plant breeders and can reduce the progress of quantitative traits from selection. Hence, in genetic and breeding studies, it is very important to consider the non-allelic interaction effect as well as the precision of their estimation. This paper describes the use of weighted multiple linear regression to estimate QTL-by-QTL epistatic effects, with three forms of weighting being proposed. The effectiveness of the proposed methods was validated by a series of simulations. The results confirmed the universality and stability of additive-by-additive epistatic interaction effects assessed by the weighted regression method. Similar relationships between phenotypic and genotypic estimators were obtained, irrespective of the number of linkage groups (number of chromosomes), distance between markers, the number and location of QTL-QTL epistatic pairs and the different variants of line variances.

Conclusion
Multiple linear weighted regression is useful for estimating additive-by-additive epistatic interactions. Irrespective of the variants of line variances, similar relationships were obtained between aa f , aa ĝ 0 , aa ĝ 1 , aa ĝ 2 and aa ĝ 3 . This finding indicates that these methods of estimation may be applied to different experimental situations. The ordinary weighted multiple linear regression method, with weights based on an estimated variance, is the preferred method because it provides results closer to the true values. Overall, the results described here indicate that the estimation of epistatic interaction effects by the weighted regression method may be applied to different plant species.