Bayesian estimation of genotypic and phenotypic correlations from crop variety trials

Genotypic and phenotypic correlations are necessary for constructing indirect selection indices. Bayesian analysis, therefore, was applied to obtain posterior distributions of the correlations, and the estimates were compared with those under a frequentist approach. Three a priori distributions for standard deviation components based on uniform distribution, positive values from tdistribution, and positive values from normal distribution were examined, while a priori distribution for correlation was taken as a uniform distribution. The prior based on uniform was best found using the deviation information criterion. Data from sorghum genotypes evaluated in complete blocks in 2010-2011 in Northern Kordofan, Sudan, resulted in a posterior mean of 0.48 for genotypic correlation between seed yield and seed weight with posterior standard deviation of 0.24. Due to a wider inference base and the fact that it makes use of prior information, we recommend the Bayesian approach in estimation of genotypic correlations.


INTRODUCTION
Genotypic and phenotypic correlations between plant traits are used as measures of their association (Ahmad et al. 2010).Estimates of genotypic and phenotypic correlations between traits are useful in planning and evaluating breeding value (Desalegn et al. 2009).Knowledge of genotypic and phenotypic association among economically valuable traits can help plant breeders in identifying efficient breeding strategies for development of high yielding wheat cultivars (Abbasi et al. 2014).Though estimation of genotypic correlations and phenotypic correlations is straightforward, evaluation of their precision in terms of standard errors and significance testing is quite cumbersome (Singh et al. 1997).
Over the course of experimentation, crop improvement programs gather information on genotypic and experimental error variability, which can be used in the Bayesian approach.In the Bayesian framework, one integrates prior information with the likelihood of current data and draws inferences in terms of conditional distribution of parameters of interest, given the data.In this process, an estimate of the parameter is assessed as posterior mean and precision as posterior standard deviation (Gelman et al. 2004).In contrast, the commonly used frequentist approach does not make use of such information.Singh et al. (2015) have presented a systematic approach for Bayesian analysis of trials conducted in complete or incomplete block designs.The priors discussed in their work have been incorporated in this study.This paper focuses on the Bayesian approach for estimation of genotypic and phenotypic correlations from crop variety trials and compares them with a frequentist approach.
The frequentist approach is normally based on estimation of variance components using a mixed model.The MIXED procedure in SAS software (SAS Institute 2011) provides REML estimates of variance and covariance components among model factors and allows both fixed and random effects to be fitted in a mixed model analysis (Littell et al. 1998).Plant breeders have traditionally estimated genotypic and phenotypic correlations between traits using a multivariate analysis of variance (MANOVA) or a REML method (Hussain et al. 2012).From the Bayesian perspective on genotypic and phenotypic correlations, posterior inference can be drawn using Markov Chain Monte Carlo (MCMC) methods (Tierney 1994).Schisterman et al. (2003) investigated estimation of the correlation coefficient using the Bayesian approach and its applications in epidemiological research and found it useful for evaluating relationships between variables with measurement errors.More details on Bayesian estimation of correlation may be found in Liechty et al. (2004) for models providing a framework for representing and learning about dependence structures.The objective of this study is to estimate genotypic and phenotypic correlations and their standard errors using Bayesian and frequentist approaches when data on traits have been collected from a crop variety trial conducted in a randomized complete block design.The necessary computing codes are also provided using R2WinBUGS and R-packages.

Experimental data
A set of 18 sorghum genotypes were evaluated in a randomized complete block design (RCBD) with four replications.The experiment was carried out in the 2010-2011 season at El Obeid Research Station, Agricultural Research Corporation (ARC), Northern Kordofan, Sudan.Plot-wise data on grain yield in kg ha -1 (GY) and 1000 seed weight in gm (SW) were recorded.

Frequentist approach
In this approach, we consider estimation of genotypic correlation from a randomized complete block design (RCBD) data on two traits -X (for example, yield) and Y (for example, seed weight).The ρ gxy denotes the genotypic correlation between traits X and Y in a population of inbred lines.We consider v inbred lines are randomly selected from the population of interest and are evaluated in an RCBD with r replications in a single environment.The responses X ij and Y ij from the plot of the i th genotype of the j th replicate are modeled as: where for the two traits X and Y, μ x and μ y are general means, β jx and β jy are effects of the j th block, g ix and g iy are effects of the i th genotype sampled, and ε ijx and ε ijy are random errors, respectively (Singh and Hinkelmann 1992).
The parameter vector ( μ x μ y ) is assumed to be fixed.
Given the above background, the genotype correlation between traits X and Y is estimated as: where σˆg xy is the estimated genotypic covariance between traits X and Y, σˆg x is the estimated genotypic standard deviation for trait X, and σˆg y is the estimated genotypic standard deviation for trait Y. Thus, the estimate of ρ g is obtained in terms of the estimates of the variance and covariance components σ 2 gx , σ 2 gy and σ gxy .The variance components σ 2 gx and σ 2 gy can be estimated by using the residual (otherwise known as "restricted") maximum likelihood (REML) method (Patterson andThompson 1971, Singh et al. 1997).From the covariance σ gxy obtained, we can construct a new variable Z with the plot-wise values as where, where The genotypic variability of variable Z, denoted by σ 2 gz , is expressed as: Crop Breeding and Applied Biotechnology 16: 14-21, 2016 SO Omer et al.
Thus, the covariance component σ gxy can be written in terms of variance components as We now apply the REML method on Z ij values of Z to obtain an estimate σ ˆ 2 gz of σˆ 2 gy .Substituting the estimates of the three variance components in ( 5 In order to compute phenotypic correlation, we consider the additive model for the phenotypic value -phenotypic value = genotypic value + environmental effect.After ignoring the variation in controlled factors, if any, we can write the phenotypic variances and covariance as follows: Using equation ( 3), the covariance σ exy can be obtained from the variance components σ 2 ex , σ 2 ey and σ 2 ez , where z = x + y using Thus, the phenotypic correlation ρ pxy and the environmental correlation ρ exy between the traits X and Y are expressed as: Standard error of the estimates of phenotypic and environmental correlation can be obtained using Singh et al. (1997) with the delta method.Similar approaches have been described by Miller et al. (1958) using the corresponding variance and covariance components (Fikreselassie et al. 2012).The approach presented here is based on a univariate approach to variables X, Y, and Z=X+Y.An alternative approach is to use a multivariate formulation implemented in several software programs.In our experience, multivariate approaches more often resulted in non-convergence than the univariate approach (e.g., REML method) did.The variance components for X and Y were also used to estimate the broad-sense heritability of the traits on a mean basis, using the expression (h 2 x = σ 2 gx /(σ 2 gx + σ 2 ex /r) for trait X (as for trait Y), where r is the number of replications; see also Singh el al. (2015).The estimation under the frequentist approach was carried out using Genstat software (Payne 2014).

Bayesian approach
Knowledge of a priori probability distribution of parameters of interest is required for making estimates under the Bayesian paradigm (Kizilkaya et al. 2002).To introduce the subject, consider the Bayesian approach for estimation of a single parameter θ using an observed data vector y = (y 1 ,...,y n ).One introduces a degree of belief in the parameter θ in terms of its probability distribution function, for example g(θ), called a priori distribution of θ, or simply a prior for θ.The inference about θ is obtained in terms of the probability distribution of θ given the data y and is expressed as p(θ | y)∞g(θ) f (y | θ) and called the a posteriori, or simply a posterior, density function of θ,which is obtainable from the famous Bayes' Theorem available in standard texts (Ntzoufras 2002, Rowe 2003, Gelman et al. 2004, Robert and Casella 2004).Using this a posteriori density, one can obtain the expected value of θ as an estimate of θ, standard error, and its Bayesian confidence intervals.The posterior distributions for each of ρ βxy , ρ gxy , and ρ exy can be obtained using the following expression for the situation of a general case of s parameters θ 1 , θ 2 ,..., and θ s .Let us denote the vector θ = (θ 1 , θ 2 ,..., θ s ).Furthermore, let the bivariate data (x, y) be generated on a pair of variables (X, Y) from the probability density function denoted by f(x, y | θ).The a posteriori distribution of θ k (k = 1, 2…, s) based on an assumed joint a priori distribution g(θ) of θ is given by: p(θ k |(x,y))∞ʃ...ʃg(θ)f(x,y|θ)dθ 1 dθ 2 ...dθ k-1 dθ k+1 ...dθ s The priors used include uniform, half normal, and gamma distributions for genotypic and phenotypic standard deviation components and uniform distribution for the correlations.Wong et al. (2003) proposed a prior probability model for the precision matrix in the case of multivariate responses.For responses from an RCBD, mixed linear models were used to estimate the variance components (Vargas et al. 2013).In the present context, the parameters of model (1) are μ x , μ y , β jx , β jy , g ix , g iy (the effects), σ βx , σ βy , σ gx , σ gy , σ ex , σ ey (the standard deviations), and ρ βxy , ρ gxy and ρ exy (the correlations).Priors are needed for standard deviations and correlations in the above.Following Gelman (2006), we used non-informative priors for scale parameters involved in these correlation parameters as uniform, positive half-t, and positive half-normal families of distributions (Crossa et al. 2010).The following sets of prior distribution were considered.
Since there are multiple priors, the best prior distribution was selected using a discrepancy criterion, the deviance information criterion (DIC), commonly considered for prior model selection (Gelman et al. 2004, Griffin andBrown 2012).The inference on the correlations was drawn using the best prior.We used the R2WinBUGS package and Rcodes given in the Appendices.The number of iterations was set at 100,000 with three chains, and 5000 simulation values were taken for statistical summaries on the posteriors.Unlike the univariate approach in the frequentist method, here we used a multivariate (bivariate) framework in the Bayesian computations.In the bivariate case, the calculations were carried out by defining the priors at each element of the variance-covariance matrix.Alternatively, particularly with more than two traits, one may use Wishart distribution.

Selection of priors
Choices of priors for Bayesian analysis were made from the statistics given in Table 1.Deviance information criteria (DIC) values were 1158.02 for P 1 , 1168.11 for P 2 , and 1631.9 for P 3 .However, the prior set P 1 has the lowest numerical value of DIC (1158.02);we took P 1 for estimation of the genetic parameters.

Genotypic and phenotypic variance components and heritability
Table 2 shows the frequentist estimates of the genotypic, phenotypic, and environmental variances and their estimated standard errors, as described in Singh and El-Bizri (1992) and the asymptotic 95% confidence intervals.Bayesian estimates are based on the best priors set (P 1 ) selected using the DIC.The posterior means of genotypic and environmental variance components were higher than the associated estimates in the frequentist version.Estimates of broad-sense heritability on a mean basis followed a similar trend, with Bayesian vs frequentist approach estimates as 0.94 vs. 0.95 for GY and 0.67 vs. 0.70 for SW.

Genotypic, phenotypic, and environmental correlations
For the frequentist approach, Table 3 presents estimates, estimated standard errors, and asymptotic confidence intervals of the genotypic, phenotypic, and environmental correlations between GY and SW, whereas for Bayesian and frequentist approaches, it presents their posterior means, standard deviations, and medians, along with credible and confidence intervals.Genotypic, phenotypic, and environmental correlations between GY and SW under the frequentist vs. Bayesian approach were 0.547 vs. 0.475, 0.377 vs. 0.328, and 0.226 vs. 0.216, respectively.A comparison between means and median showed that the Bayesian posterior distributions of these correlations are slightly skewed.The precision levels of various correlations were reasonably close for the two approaches.Sorghum genotypes considered in the trial showed significant genetic variability for grain yield (GY) and 1000 seed weight (SW).The study makes use of prior information in terms of distributions of various variance components that may be made available from an ongoing series of crop variety trials.How the information can be utilized has been shown by the Bayesian approach, which integrates the prior information with the likelihood of the current datasets, so as to draw inferences on genotypic, phenotypic, and environmental correlations.Variable degrees of differences between the Bayesian and frequentist approaches have been found in the precision levels of the estimates of variancecomponent-based parameters in other studies (Singh et al. 2015).In the case of the Bayesian approach, the precision associated with a parameter depends on the priors used.The merit of the Bayesian approach depends on the premise of its allowing for a realistic coverage of the distribution of various parameters used as priors.The Bayesian approach may not necessarily result in a lower posterior standard deviation of a parameter in comparison to the standard error of estimate of the parameter in the frequentist approach.
Such investigations need to be carried out on other datasets to make an assessment of trends in the precision obtained by these two approaches.The most commonly used priors for variance components in terms of the standard deviation components have been used (Gelman 2006), but classes of other relevant priors (Crossa et al. 2010) may also be included to examine support from data using the deviance information criterion.The simulation in the Bayesian approach using the R2WinBUGS software (Spiegelhalter et al. 2002) enables evaluation of the posterior distribution of the derived correlations in terms of variance and covariance components, unlike the frequentist methods where the simplification of the distribution is commonly made as asymptotic approximation (Singh and El-Bizri 1992).The R2WinBUGS software facilitated summaries in terms of posterior mean and median make inferences regarding the symmetry of the distributions and the percentiles in reporting the credible intervals.Bayesian computation can also use the information from the experimental units that have data on additional units for only a single trait (broken samples) to estimate the genotypic and phenotypic correlations and the variance components for those traits.Furthermore, study in Bayesian estimation should be extended to multivariate cases (with more than two traits) in future investigations in plant breeding.Accordingly, heterogeneity in environmental variances and in genotype variances should also be the aspect of a future study by considering suitable models for heterogeneity of variances.
In summary, this study presents the Bayesian approach for estimation of genotypic and phenotypic correlations between traits from crop variety trials using the priors on standard deviation components and correlations obtainable from a series of previously conducted trials.The R2WinBUGS software was used for Bayesian estimates of genotypic and phenotypic correlations using experimental design data.Uniform distribution based on the priors set was found to be best, which led to precision similar to the frequentist approach.Due to its sound inference base, the Bayesian approach with WinBUGS and R codes is recommended for use in estimation of genotypic correlation in plant breeding trials.
), we get an estimate σ ˆ gxy where

Table 2 .
Estimates of variance components and broad-sense heritability on a mean basis for grain yield and 1000 seed weight under the frequentist and Bayesian approach for the 2010-11 dataset : Standard error.SD: standard deviation.The SE and 95% confidence intervals for the frequentist approach estimates are based on asymptotic normal approximation. SE

Table 3 .
Estimates of genotypic (ρ g ),phenotypic (ρ p ), and environmental (ρ e ) correlations between grains yield (GY) and 1000 seed weight (SW) under frequentist and Bayesian approaches for the 2010-11 dataset SE=standard error.SD= standard deviation.The SE and 95% confidence intervals for the frequentist approach estimates are based on asymptotic normal approximation.