Acessibilidade / Reportar erro

Determination of optimal number of independent components in yield traits in rice

ABSTRACT:

The principal component regression (PCR) and the independent component regression (ICR) are dimensionality reduction methods and extremely important in genomic prediction. These methods require the choice of the number of components to be inserted into the model. For PCR, there are formal criteria; however, for ICR, the adopted criterion chooses the number of independent components (ICs) associated to greater accuracy and requires high computational time. In this study, seven criteria based on the number of principal components (PCs) and methods of variable selection to guide this choice in ICR are proposed and evaluated in simulated and real data. For both datasets, the most efficient criterion and that drastically reduced computational time determined that the number of ICs should be equal to the number of PCs to reach a higher accuracy value. In addition, the criteria did not recover the simulated heritability and generated biased genomic values.

Keywords:
Oryza sativa L.; genomic prediction; plant breeding; principal component regression; independent component regression

Introduction

The prediction process in Genome Wide Selection (GWS) (Meuwissen et al., 2001Meuwissen, T.H.E; Hayes, B.J.; Goddard, M.E. 2001. Prediction of total genetic value using genome wide dense marker maps. Genetics 157: 1819-1829.) presents statistical problems related to high dimensionality (number of markers greater than the number of individual phenotypic observations) and multicollinearity (highly correlated markers), which affect the accuracy of methods based on ordinary least squares (OLS) (Desta and Ortiz, 2014Desta, Z.A.; Ortiz, R. 2014. Genomic selection: genome-wide prediction in plant improvement. Trends in Plant Science 19: 592-601.).

In this context, methodologies to solve such statistical challenges have gained prominence in GWS research. Resende et al. (2012)Resende, M.D.V.; Silva, F.F.; Lopes, P.S.; Azevedo, C.F. 2012. Genomic Wide Selection (GWS) by Mixed Models (REML/BLUP), Bayesian Inference (MCMC), Multivariate Random Regression (RRM) and Spatial Statistics =Seleção Genômica Ampla (GWS) via Modelos Mistos (REML/BLUP), Inferência Bayesiana (MCMC), Regressão Aleatória Multivariada e Estatística Espacial. Editora UFV,Viçosa, MG, Brazil (in Portuguese). reported that the statistical methodologies applied to GWS could be divided into three groups: methods based on explicit regression, implicit regression, and the dimensionality reduction methods. Among these, the dimensionality reduction methods, the Principal Component Regression (PCR), and the Independent Component Regression (ICR) are highlighted when compared to the other methods applied to GWS as they present great applicability and relatively simple theory.

The PCR and ICR require the choice of the optimal number of components, which are linear combinations of the markers, to be inserted in the prediction equation. The statistical theory of PCR demonstrates that the first components represent most of the total data variability. Le Floch et al. (2012)Le Floch, É.; Guillemot, V, Frouin.; V, Pinel, P.; Lalanne, C.; Trinchera, L.; Tenenhaus, A.; Moreno, A.; Zilbovicius, M.; Bourgeron, T.; Dehaene, S.; Thirion, B.; Poline, J.B.; Duchesnay, É. 2012. Significant correlation between a set of genetic polymorphisms and a functional brain network revealed by feature selection and sparse Partial Least Squares. Neuroimage 63: 11-24. presented the criterion for choosing the optima number based on this assertion.

In genomic selection, effective methodologies for the prediction process are desirable and accuracy is one of the main measurements of efficacy. Azevedo et al. (2014Azevedo, C.F.; Silva, F.F.; Resende, M.D.V.; Lopes, M.S.; Duijvesteijn, N.; Guimarães, S.E.F.; Lopes, P.S.; Kelly, M.J.; Viana, J.M.S.; Knol, E.F. 2014. Supervised independent component analysis as an alternative method for genomic selection in pigs. Journal of Animal Breeding and Genetics 131: 452-461., 2015Azevedo, C.F.; Resende, M.D.V.; Silva, F.F.; Viana, J.M.S.; Valente, M.S.F.; Resende Junior, M.F.R.; Muñoz, P. 2015. Ridge, LASSO and bayesian additive-dominance genomic models. BMC Genetics 16: 1-13.) chose the number of independent components (ICs) associated to greater accuracy; however, the execution of the analyses required a high computational effort, which often becomes impractical.

In this study we aimed to propose and evaluate, using simulated genomic data, seven decision criteria for the optimal number of components to be inserted into the template. We also evaluated seven criteria with real data in the genomic prediction of six rice yield traits to elucidate the importance of the procedures described in this study for breeding programs and the importance of genomic prediction for the Asian rice Oryza sativa L. (Grenier et al., 2015Grenier, C.; Cao, T.V.; Ospina, Y.; Quintero, C.; Châtel, M.H.; Tohme, J.; Courtois, B.; Ahmadi, N. 2015. Accuracy of genomic selection in a rice synthetic population developed for recurrent selection breeding. PloS One 10: e0136594.; Hassen et al., 2018Hassen, M.B.; Cao, T.V.; Bartholomé, J.; Orasen, G.; Colombi, C.; Rakotomalala, J.; Bertone, C.; Biselli, C.; Volante, A.; Desiderio, F.; Jacquin, L.; Valè, G.; Ahmadi, N. 2018. Rice diversity panel provides accurate genomic predictions for complex traits in the progenies of biparental crosses involving members of the panel. Theoretical and Applied Genetics 131: 417-435.; Spindel et al., 2015Spindel, J.E.; Begum, H.; Akdemir, D.; Virk, P.; Collard, B.; Redoña, E.; Atlin, G.; Jannink, J.L.; Mccouch, S.R. 2015. Genomic selection and association mapping in rice (Oryza sativa): effect of trait genetic architecture, training population composition, marker number and statistical model on accuracy of rice genomic selection in elite, tropical rice breeding lines. PLOS Genetics 11: e1004982.; Spindel et al., 2016Spindel, J.E.; Begum, H.; Akdemir, D.; Collard, B.; Redoña, E.; Jannink, J.L.; McCouch, S. 2016. Genome-wide prediction models that incorporate de novo GWAS are a powerful new tool for tropical rice improvement. Heredity 116: 395-408.).

Materials and Methods

The simulated dataset was generated as described by Azevedo et al. (2015Azevedo, C.F.; Resende, M.D.V.; Silva, F.F.; Viana, J.M.S.; Valente, M.S.F.; Resende Junior, M.F.R.; Muñoz, P. 2015. Ridge, LASSO and bayesian additive-dominance genomic models. BMC Genetics 16: 1-13., 2017Azevedo, C.F.; Resende, M.D.V.; Nascimento, M.; Viana, J.M.S.; Valente, M.S.F. Population structure correction for genomic selection through eigenvector covariates. 2017. Crop Breeding and Applied Biotechnology 17: 350-358.). We simulated 2,000 equidistant Single Nucleotide Polymorphisms (SNPs) markers separated by 0.1 centiMorgan among ten chromosomes. The quantitative trait loci (QTLs) were randomly distributed in the regions covered by the SNPs. We genotyped and phenotyped 1,000 individuals from 20 families of full siblings. The simulations assumed absence of dominance and four scenarios were used in the analyses: two heritability levels in the restricted sense (about 0.20 and 0.30) × two genetic architectures (polygenic and mixed inheritance). The scenarios were analyzed considering the dimensionality reduction methods, ICR and PCR, and the criteria of choice of the components. Each type of population (or scenarios) was simulated ten times.

The real data set corresponded to the Asian rice and the database used in this study consisted of six yield traits referring to 370 accessions of rice, which were genotyped to 44,100 SNP markers. This dataset is free and is part of two projects, the OryzaSNP Project and the OMAP Project (Ammiraju et al., 2006Ammiraju, J.S.S.; Luo, M.; Goicoechea, J.L.; Wang, W.; Kudrna, D.; Mueller, C.; Talag, J.; Kim, H.; Sisneros, N.B.; Blackmon, B.; Fang, E.; Tomkins, J.B.; Brar, D.; Mackill, D.; Maccouch, S.; Kurata, N.; Lambert, G.; Galbraith, D.W.; Arumuganathan, K.; Rao, K.; Walling, J.G.; Gill, N.Y.U.Y.; Sanmiguel, P.; Soderlund, C.; Jackson, S.; Wing, R.A. 2006. The Oryza bacterial artificial chromosome library resource: construction and analysis of 12 deep-coverage large-insert BAC libraries that represent the 10 genome types of the genus Oryza. Genome Research 16: 140-147.; Zhao et al., 2011Zhao, K.; Tung, C.W.; Eizenga, G.C.; Wright, M.H.; Ali, M.L.; Price, A.H.; Norton, J.G.; Islam, A.R.; Reynolds, A.; Mezey, J.; Mcclung, A.M.; Bustamante, C.D.; McClung, A.M. 2011. Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa. Nature Communications 2: 1-10.) and it is available at https://ricediversity.org/data/. The six traits of rice yield used in this study were: (i) panicle number per plant, (ii) plant height, (iii) panicle length, (iv) primary panicle branch number, (v) seed number per panicle, and (vi) florets per panicle.

The linear model is given by:

(1) y = 1 μ + X m a + e ,

where: y is the vector of phenotypic observations with dimension I × 1, where I is the number of individuals genotyped and phenotyped, μ is the overall mean of the trait, ma is the vector of additive marker effects with incidence matrix X composed of values 0, 1, and 2 whose dimension is I × J. J is the number of markers and e is the vector of random errors with the structure of variance given by e~N(0,Iσe2), where I is the identity matrix and σe2 is the residual variance.

The Principal Component Regression (PCR) and the Independent Component Regression (ICR) can be used in any situation where there are problems of high dimensionality. The main difference between the methods is that in the PCR, the principal components Zm(m = 1, ..., nPCR) are orthogonal components and the first components explain much of the total variability. In the ICR, the components built are independent, that is, there is no functional relationship between the components that explain small parts and in different proportions the total data variability.

The Principal Component Regression determines that the PCs are defined as:

(2) Z = X P ,

where: X is the incidence matrix of the markers and P is the matrix of the eigenvectors of the covariance matrix of X. The first component is associated to the largest eigenvalue of the eigenvectors matrix and is the percentage of explanation of the jth component given by:

λ j j = 1 m σ j 2 ,

where λj is the corresponding eigenvalue. In order to perform the prediction of the genomic values, the vector of phenotypic observations (y) is related to the components (Z) and for this regression to be possible, the number of components to be inserted into the model (nPCR) is less than or equal to (I, J) – 1. After this choice, the nPCR first components, Z1, Z2, … , ZnPCR, are selected and the adjusted prediction equation is y^=Z1α^1+Z2α^2+...+ZnPCRα^nPCR, where α^ =[α^1 α^2α^nPCR] is the vector of the estimated regression coefficients obtained by the OLS method.

The coefficients α are not related to the markers. The following expression is used to find the estimates of the effects of the markers:

(3) m ^ P C R = P n P C R α ^ ,

where: PnPCR is the matrix of associated eigenvectors to components Z1, Z2, ..., ZnPCR.

The ICR decomposes the matrix X into X = SA′, where S (I × nICR) is an ICs matrix. A (J × nICR) is called the mix matrix, which is usually unknown, and nICR is the number of ICs chosen. To estimate matrix A, the first step is to obtain matrix K (called the whitening matrix) by orthogonal decomposition of the covariance matrix of X to ensure that the covariance matrix of XK is equal to the identity matrix, the correlation between the columns of XK is equal to 0, and the variance is equal to 1. The orthogonal decomposition is applied to the covariance matrix of X, denoted by Σ(J × J) obtaining: Σ=PΛ12P; where P is composed of the eigenvectors in its columns and Λ is a diagonal matrix of eigenvalues of the covariance matrix of X. In regression, the matrix K (J × nICR) is then defined as PrΛr12, where Pr is the matrix with nICR as the first columns of the matrix P (nICR first eigenvectors) and Λr the matrix with nICR as the first rows and columns of the matrix Λ (eigenvalues associated with these first eigenvectors). To achieve independence between the components, the algorithm proposed by Hyvärinen (1998)Hyvärinen, A. 1998. New approximations of differential entropy for independent component analysis and projection pursuit. Advances in Neural Information Processing Systems 10: 273-279., which is based on the principle of maximum entropy, is used to obtain a new matrix denoted by R (nICR × nICR). After the algorithm, the ICs can be expressed by:

(4) S = X K R .

Then, the prediction equation between the response variable Y and the ICs S1, S2, ..., SnICR is given by y^=β^1s1+β^2s2+...+β^nICRsnICR, where β^=[β^1 β^2...β^nICR] is the vector of the regression coefficient estimates obtained by the OLS method. Analogously to the PCR, to find the estimates of the effects of the markers, it is enough to use the expression:

(5) m ^ I C R = K R β ^ .

The simulated and real datasets were analyzed using two populations (estimation and validation population) according to both validation procedures. In the simulated data, the criteria were compared by means of an independent validation in which the first nine simulations were assumed as estimation populations and the 10th simulation was assumed as the validation population. The real data were evaluated under a ten-fold validation process. The use of different validation processes is justifiable, because the real dataset comprise a small number of individuals (370), which makes independent validation unviable and, in these cases, James et al. (2013)James, G.; Witten, D.; Hastie, T.; Tibshirani, R. 2013. An Introduction to Statistical Learning. Springer, New York, NY, USA. suggest a ten-fold validation.

The criteria analyzed aimed to determine the optimal number of ICs using the following procedures.

Criterion 1 (Based on predictive ability or accuracy obtained through PCR fit): For each PC (m = 1, ..., min(IJ)–1), the effects of the markers in estimating the population are estimated by the PCR and they are used in the validation population to estimate the genomic breeding values of the individuals of this population. Then, for the simulated data, we analyzed the accuracy (r), the correlation between the estimated genomic value and the real genomic value (r = Cor(â, a)), and for the real data (r), the correlation between the estimated genomic value and the phenotypic value r = Cor(â, y). This analysis ensures that the number of ICs is equal to the number of PCs whose genomic value leads to greater accuracy and predictive ability. Cadavid et al. (2008)Cadavid, A.C.; Lawrence, J.K.; Ruzmaikin, A. 2008. Principal components and independent component analysis of solar and space data. Solar Physics 248: 247-261. and Azevedo et al. (2013)Azevedo, C.F.; Resende, M.D.V.; Silva, F.F.; Lopes, O.S.; Guimarães, S.E.F. 2013. Independent component regression applied to genomic selection for carcass traits in pigs. Pesquisa Agropecuária Brasileira 48 : 619-626., corroborated the use of PCR in the choice of the number of ICs.

Criterion 2 (Based on bias and predictive ability or accuracy obtained through the PCR fit): In Criterion 2, the same procedure in Criterion 1 was used, but the regression coefficient (b) is calculated between the phenotype and the estimated genomic value and, subsequently, the prediction bias given by 1-b. Thus, the number of ICs is determined as equal to the number of PCs whose genomic value leads to a smaller prediction bias.

Criterion 3 (Based on the percentage explanation of the total variation of the markers after obtaining the PCs): The percentage explanation of the total variation of X when using m PCs is given by:

p m ( % ) = j = 1 m λ j j = 1 J λ j ,

where λj is the eigenvalue corresponding to the jth eigenvector of the covariance matrix of X. Criterion 3 determines that the number of ICs is equal to the number of PCs that explain 80 % of the total variation of X, as recommended by Ferreira (2012)Ferreira, D.F. 2012. Multivariate Statistics = Estatística Multivariada. Editora UFLA, Lavras, MG, Brazil (in Portuguese).. The researcher can also choose another threshold value and it must consider the explanation percentage of the data variation and the dimensionality reduction caused.

Criterion 4 (Based on the coefficient of determination obtained after the PCR fit): Using the coefficient of determination (R2 = Cor(y, â)2 × 100%), the IC number is chosen as equal to the number of PCs explaining 80 % of the total variation of Y.

Criterion 5 (Based on the percentage of explanation of the total variation of markers after obtaining ICs):Assuming that the ICs have means equal to 0 and variances equal to 1, the variation explained by the kth IC is given by:

I j = 1 J a j k 2 i = 1 I j = 1 J x i j 2

where ajk is the element of the jth row and the kth column of the matrix of mixtures A, xij is the element of the ith row and jth column of the centered matrix of explanatory variables X (i= 1, 2, ..., I) (Bingham and Hyvärinen, 2000Bingham, E.; Hyvärinen, A. 2000. A fast fixed-point algorithm for independent component analysis of complex valued signals. International Journal of Neural Systems 10: 1-8.; Helwig and Hong, 2013Helwig, N.E.; Hong, S.A. 2013. Critique of tensor probabilistic independent component analysis: implications and recommendations for multi-subject fMRI data analysis. Journal of Neuroscience Methods 213: 263-273.). The number of ranked ICs that explain 80 % of the total variation of X is then chosen.

Criterion 6 (Based on the IC’s Forward Selection algorithm): After the application of the ICA in matrix X, were determined which are predictors, ICs m= ((I, J) – 1), to be included in the model. For this, based on the Forward Selection algorithm described by James et al. (2013)James, G.; Witten, D.; Hastie, T.; Tibshirani, R. 2013. An Introduction to Statistical Learning. Springer, New York, NY, USA., the M0 model without ICs is considered. For the first iteration of the algorithm, the models with only one IC, denoted by M1i(i = 1, ...,(I, J) –1) are constructed and R2 is calculated for each model. Subsequently, the model with the highest R2 is defined as model M1. In the second iteration, the models with two ICs (all models must contain the component that is the model predictor M1) denoted by M2i are constructed and the model with the larger R2 is denoted as M2. This procedure is performed (I, J) –1 times for determinate the models M1, M2, … M(I,J)–1 with 1, ..., (I, J)–1 ICs, respectively, in each model. Among all these models, the model with the lowest BIC (Bayesian Information Criterion) was chosen. The present criterion determines which and how many ICs must be used in the chosen model.

Criterion 7 (Based on the IC Backward Elimination algorithm): Based on the Backward Elimination algorithm, as described by James et al. (2013)James, G.; Witten, D.; Hastie, T.; Tibshirani, R. 2013. An Introduction to Statistical Learning. Springer, New York, NY, USA., the complete model Mn(I,J)–1 is considered, that is, the model with the maximum number of ICs built after the application of ICA. Subsequently, the models with (I, J)–2 components are defined as M(I, J)–2, which were constructed by removing one IC at a time and calculated the R2 for each model. It is denoted as M(I,J)–2, the model with the largest R2. The process is repeated to determine the models M((I,J)–3), ..., M1. The component that is not included into the model also does not participate in the following iteration. Then, from these (I, J)–1 models, only the model that features lower BIC. The present criterion determines which and how many ICs must be in the chosen model.

In the simulated data, efficacy measurements of genomic prediction were calculated for each replicate, such as: i) accuracy (râa), râa is the correlation between the genomic estimated breeding values (GEBVs – denoted by â) and the simulated genetic values (a); ii) prediction bias, which is defined as 1 - b being b the regression coefficient between phenotype (y) and GEBVs; iii ) additive genomic heritability (haM2), given by:

h a M 2 = σ a M 2 σ a M 2 + σ e 2 ,

where σaM2=j=1J2pjqjmaj2 is the additive genomic variance, σe2 is the residual variance, and pi and qi are the allelic frequencies of the jth marker. After obtaining the efficacy measures for each replicate in each scenario, the results will be the mean and standard deviation of these values. In the real data, the efficacy measurements of genomic prediction were: i) predictive ability (r), r is the correlation between the GEBVs and phenotype; ii) prediction bias; iii) additive genomic heritability.

Regarding the interpretation of efficacy measures, we have: i) high accuracy values indicate that the GEBV is close to the real genomic value; ii) high predictive ability values indicate that the GEBV is close to the phenotype; iii) regression coefficients below 1 (b < 1), it is understood that the GEBVs were overestimated, for regression coefficients above 1 (b > 1), it is concluded that the GEBVs were underestimated, and for coefficients equal to 1 (b = 1), it concludes that GEBVs are unbiased; iv) In simulated data, estimated genomic heritability should be close to simulated heritability. In real data, the estimated genomic heritability was compared to the heritability presented in other studies. The configuration of the computer used in the statistical analyses was: Intel (R) Core (TM) i7-6500 (CPU 2.50 GHz) processor with 16 Gb of RAM. All the computational routines of the methods used were implemented in GenomicLand (Azevedo et al., 2019Azevedo, C.F.; Nascimento, M.; Fontes, V.C.; Silva, F.F.; Resende, M.D.V.; Cruz, C.D. 2019. GenomicLand: software for genome-wide association studies and genomic prediction. Acta Scientiarum. Agronomy 41: e45361.) available at https://licaeufv.wordpress.com/genomicland/.

Results and Discussion

The mean results and the deviations from the simulations regarding the number of components, additive molecular heritability, accuracy, and bias considering the ICR and each criterion for choosing the optimal number of ICs are presented in Tables 1 and 2. In addition, the results of the analyses of the calculating the number of components required to reach the maximum value of accuracy via ICR by the exhaustive method are also presented.

Table 1
The parametric additive heritability (hMapar2), the number of components (Nc), additive heritability (haM2), accuracy (r), and regression coefficient considering (b^ya^) each criterion of choice for the number of independent components and the scenarios of polygenic inheritance.
Table 2
The number of components (Nc), additive heritability haM2, predictive capacity (r), and prediction bias b^ya^ considering the exhaustive method and each criterion for choice of number of independent components.

Among the seven criteria analyzed, criteria 1, 3, 6, and 7 presented the values of accuracy closer to the maximum accuracy value obtained by the exhaustive method, considering the four scenarios. Although criteria 6 and 7 presented high accuracy values, both criteria overestimated the genomic values, which can be observed in the regression coefficient, revealing that the estimates found have variability beyond the simulated ones. Criteria 1 and 3 present values closer to unity than those obtained by the exhaustive method, highlighting Criterion 1 even more as it presents high accuracy and low bias. The bias property is relevant because that selection involves individuals of many generations using effects of estimated markers in a single generation, which is desirable not only to select individuals, but also to determine the genomic merits of individuals (Resende et al., 2014Resende, M.D.V.; Silva, F.F.; Azevedo, C.F. 2014. Mathematical, Biometric and Computational Statistics: Mixed, Multivariate, Categorical and Generalized Models (REML / BLUP), Bayesian Inference, Random Regression, Genomic Selection, QTL-GWAS, Spatial and Temporal Statistics, Competition, Survival = Estatística Matemática, Biométrica e Computacional: Modelos Mistos, Multivariados, Categóricos e Generalizados (REML/BLUP), Inferência Bayesiana, Regressão Aleatória, Seleção Genômica, QTL-GWAS, Estatística Espacial e Temporal, Competição, Sobrevivência. Editora Suprema, Visconde do Rio Branco, MG, Brazil (in Portuguese).).

No criterion were adequate to estimate heritability in scenarios 2, 3, and 4, since the values do not recover the simulated heritability. However, these values are close to the heritability attained by the exhaustive criterion, considering the maximum value of accuracy via the ICR. In criteria 4, 5, 6, and 7, heritability estimates equal 1 and these criteria are associated to the largest number of components in the model. Likewise, we evaluated the number of components influencing the heritability estimation and the extent to which components are included in the model where heritability tends to 1. This can be explained by the ICR method assuming the SNPs as fixed effects; since according to Resende et al. (2014)Resende, M.D.V.; Silva, F.F.; Azevedo, C.F. 2014. Mathematical, Biometric and Computational Statistics: Mixed, Multivariate, Categorical and Generalized Models (REML / BLUP), Bayesian Inference, Random Regression, Genomic Selection, QTL-GWAS, Spatial and Temporal Statistics, Competition, Survival = Estatística Matemática, Biométrica e Computacional: Modelos Mistos, Multivariados, Categóricos e Generalizados (REML/BLUP), Inferência Bayesiana, Regressão Aleatória, Seleção Genômica, QTL-GWAS, Estatística Espacial e Temporal, Competição, Sobrevivência. Editora Suprema, Visconde do Rio Branco, MG, Brazil (in Portuguese)., when the markers are assumed to be fixed effects, heritability is implicitly assumed to equal 1.

Regarding the Forward Selection and Backward Elimination criteria (criteria 6 and 7, respectively), the variable selection methods aimed to remove variables that are not relevant or those not closely related to the dependent variable (James et al., 2013James, G.; Witten, D.; Hastie, T.; Tibshirani, R. 2013. An Introduction to Statistical Learning. Springer, New York, NY, USA.). In the case of the ICR, the components were independent (the components were uncorrelated and without any functional relation to each other) and thus more variables were needed, that is, more components to explain the response variable in criteria 6 and 7. The prediction of genomic values using these selection criteria was not adequate since the criteria associated to the largest biases (coefficient values close to 0) were not adequate.

Other criteria have been proposed, such as Akaike Information Criterion (AIC), BIC, coefficient of determination mean square of the residues, and adjusted coefficient of determination. However, the application of these suggested criteria was not feasible, since the computational time would have been the same as in the exhaustive method. Similarly, using the Stepwise Selection method, the number of variables selected resulted in the complete model (considering 999 components) that was associated to low accuracy values.

The number of components, additive molecular heritability, predictive capacity, and prediction bias for the six rice traits are shown in Table 2, considering each criterion (Criterion 1 – Based on predictive ability or accuracy obtained through PCR fit, Criterion 2 – Based on bias and predictive ability or accuracy obtained through the PCR fit, Criterion 3 – Based on the percentage explanation of the total variation of the markers after obtaining the PCs, Criterion 4 – Based on the coefficient of determination obtained after the PCR fit), Criterion 5 – Based on the percentage of explanation of the total variation of markers after obtaining ICs, Criterion 6 – Based on the IC’s Forward Selection algorithm and Criterion 7 – Based on the IC Backward Elimination algorithm)for choosing the optimal number of ICs.Likewise, the number of ICs required by the exhaustive model is also shown in Table 2. The results for the six traits corroborate the findings obtained in the simulated data.

For the real data, Criterion 1 presented values of predictive capacity closer to the maximum for the traits panicle number per plant, plant height, and panicle length. In this context, Criteria 2 and 3 were also significant for the traits panicle number per plant and panicle length, while Criterion 4 did not show prominence for any trait. The analyses of the regression coefficient showed that all the criteria were biased and Criteria 6 and 7 considerably overestimated the genomic values for all traits, as observed in the analyses of the simulated data.

In relation to the traits of plant height, panicle length, and seed number per panicle, Bisne et al. (2009)Bisne, R.; Sarawgi, A.K.; Verulkar, S.B. 2009. Study of heritability, genetic advance and variability for yield contributing characters in rice. Bangladesh Journal of Agricultural Research 34: 175-179. reported that heritability values oscillate between high and medium, indicating success in selection. Thus, considering the real dataset, the heritability values found in our study and other studies are presented in Table 3. Heritability presented by Akinwale et al. (2011)Akinwale, M.G.; Gregorio, G., Nwilene, F.; Akinyele, B.O.; Ogunbayo, S.A.; Odiyi, A.C. 2011. Heritability and correlation coefficient analysis for yield and its components in rice (Oryza sativa L). African Journal of Plant Science 5:207-212. and Seyoum et al. (2012)Seyoum, M.; Alamerew, S.; Bantte, K. 2012. Genetic variability, heritability, correlation coefficient and path analysis for yield and yield related traits in upland rice (Oryza sativa L.). Journal of Plant Sciences 7: 13-22. was estimated via pedigree and, in the context of our study, genomic heritability was considered. In addition, Ogunbayo et al. (2014)Ogunbayo, S.A.; Ojo, D.K.; Sanni, K.A.; Akinwale, M.G.; Toulou, B.; Shittu A.; Idehen, E.O.; Popoola, A.R.; Daniel, I.O.; Gregorio, G.B. 2014. Genetic variation and heritability of yield and related traits in promising rice genotypes (Oryza sativa L.). Journal of Plant Breeding and Crop Science 6: 153-159. reported a high heritability value for number of panicles in the primary panicle, which is justifiable, since these authors considered heritability in the broad sense.

Table 3
Heritability values and heritability observed in the literature for each trait.

The computational times associated to the simulated and real data in s and h are presented in Table 4. The computational time for the exhaustive method of the simulated dataset, considering a replicate of each scenario, required high computational time. This can also be observed in the real dataset using a high number of molecular markers. However, using Criterion 1, the reduction in time was drastic. This time would be substantially greater when we consider that 500,000 and 600,000 SNPs are identified in bovine and ovine genotyping (Brito et al., 2017Brito, L.F.; McEwan, J.C.; Miller, S.P.; Pickering, N.K.; Bain, W.E.; Dodds, K.G.; Schenkel, F.S.; Clarke, S.M. 2017. Genetic diversity of a New Zealand multi-breed sheep population and composite breeds’ history revealed by a high-density SNP chip. BMC Genetics 18: 1-11.; Wilkinson et al., 2017Wilkinson, S.; Bishop, S.C.; Allen, A.R.; Mcbride, S.H.; Skuce, R.A.; Bermingham, M.; Woolliams, J.A.; Glass, E.J. 2017. Fine-mapping host genetic variation underlying outcomes to Mycobacterium bovis infection in dairy cows. BMC Genomics 18: 1-13.), that is, hundreds of thousands of marker effects to be estimated considering only the additive model. It was also vrified that the computational time is drastically reduced considering Criterion 1.

Table 4
Computational time in s (h) considering the simulated data and real data and each criterion for choosing the number of independent components.

Conclusion

In general, Criterion 1, the number of ICs equals to the number of PCs that leads to a higher value of accuracy, presented an effective and computationally feasible alternative compared to the exhaustive method, both for simulated data and for the traits of real data. Criterion 3 had high accuracy values for simulated data and for some traits of real data, but essentially lower values compared to Criterion 1. Criteria 6 and 7 had high accuracy values for real and simulated data, but they overestimate the genomic breeding values. Criteria 2 and 4 had low accuracy values. None of the criteria were capable of capturing the heritability values that were simulated.

Acknowledgments

To Coordination for the Improvement of Higher Level (CAPES) and Brazilian National Council for Scientific and Technological Development (CNPq), for financial support (Finance code 001).

References

  • Akinwale, M.G.; Gregorio, G., Nwilene, F.; Akinyele, B.O.; Ogunbayo, S.A.; Odiyi, A.C. 2011. Heritability and correlation coefficient analysis for yield and its components in rice (Oryza sativa L). African Journal of Plant Science 5:207-212.
  • Ammiraju, J.S.S.; Luo, M.; Goicoechea, J.L.; Wang, W.; Kudrna, D.; Mueller, C.; Talag, J.; Kim, H.; Sisneros, N.B.; Blackmon, B.; Fang, E.; Tomkins, J.B.; Brar, D.; Mackill, D.; Maccouch, S.; Kurata, N.; Lambert, G.; Galbraith, D.W.; Arumuganathan, K.; Rao, K.; Walling, J.G.; Gill, N.Y.U.Y.; Sanmiguel, P.; Soderlund, C.; Jackson, S.; Wing, R.A. 2006. The Oryza bacterial artificial chromosome library resource: construction and analysis of 12 deep-coverage large-insert BAC libraries that represent the 10 genome types of the genus Oryza. Genome Research 16: 140-147.
  • Azevedo, C.F.; Nascimento, M.; Fontes, V.C.; Silva, F.F.; Resende, M.D.V.; Cruz, C.D. 2019. GenomicLand: software for genome-wide association studies and genomic prediction. Acta Scientiarum. Agronomy 41: e45361.
  • Azevedo, C.F.; Resende, M.D.V.; Nascimento, M.; Viana, J.M.S.; Valente, M.S.F. Population structure correction for genomic selection through eigenvector covariates. 2017. Crop Breeding and Applied Biotechnology 17: 350-358.
  • Azevedo, C.F.; Resende, M.D.V.; Silva, F.F.; Lopes, O.S.; Guimarães, S.E.F. 2013. Independent component regression applied to genomic selection for carcass traits in pigs. Pesquisa Agropecuária Brasileira 48 : 619-626.
  • Azevedo, C.F.; Resende, M.D.V.; Silva, F.F.; Viana, J.M.S.; Valente, M.S.F.; Resende Junior, M.F.R.; Muñoz, P. 2015. Ridge, LASSO and bayesian additive-dominance genomic models. BMC Genetics 16: 1-13.
  • Azevedo, C.F.; Silva, F.F.; Resende, M.D.V.; Lopes, M.S.; Duijvesteijn, N.; Guimarães, S.E.F.; Lopes, P.S.; Kelly, M.J.; Viana, J.M.S.; Knol, E.F. 2014. Supervised independent component analysis as an alternative method for genomic selection in pigs. Journal of Animal Breeding and Genetics 131: 452-461.
  • Bingham, E.; Hyvärinen, A. 2000. A fast fixed-point algorithm for independent component analysis of complex valued signals. International Journal of Neural Systems 10: 1-8.
  • Bisne, R.; Sarawgi, A.K.; Verulkar, S.B. 2009. Study of heritability, genetic advance and variability for yield contributing characters in rice. Bangladesh Journal of Agricultural Research 34: 175-179.
  • Brito, L.F.; McEwan, J.C.; Miller, S.P.; Pickering, N.K.; Bain, W.E.; Dodds, K.G.; Schenkel, F.S.; Clarke, S.M. 2017. Genetic diversity of a New Zealand multi-breed sheep population and composite breeds’ history revealed by a high-density SNP chip. BMC Genetics 18: 1-11.
  • Cadavid, A.C.; Lawrence, J.K.; Ruzmaikin, A. 2008. Principal components and independent component analysis of solar and space data. Solar Physics 248: 247-261.
  • Desta, Z.A.; Ortiz, R. 2014. Genomic selection: genome-wide prediction in plant improvement. Trends in Plant Science 19: 592-601.
  • Ferreira, D.F. 2012. Multivariate Statistics = Estatística Multivariada. Editora UFLA, Lavras, MG, Brazil (in Portuguese).
  • Grenier, C.; Cao, T.V.; Ospina, Y.; Quintero, C.; Châtel, M.H.; Tohme, J.; Courtois, B.; Ahmadi, N. 2015. Accuracy of genomic selection in a rice synthetic population developed for recurrent selection breeding. PloS One 10: e0136594.
  • Hassen, M.B.; Cao, T.V.; Bartholomé, J.; Orasen, G.; Colombi, C.; Rakotomalala, J.; Bertone, C.; Biselli, C.; Volante, A.; Desiderio, F.; Jacquin, L.; Valè, G.; Ahmadi, N. 2018. Rice diversity panel provides accurate genomic predictions for complex traits in the progenies of biparental crosses involving members of the panel. Theoretical and Applied Genetics 131: 417-435.
  • Helwig, N.E.; Hong, S.A. 2013. Critique of tensor probabilistic independent component analysis: implications and recommendations for multi-subject fMRI data analysis. Journal of Neuroscience Methods 213: 263-273.
  • Hyvärinen, A. 1998. New approximations of differential entropy for independent component analysis and projection pursuit. Advances in Neural Information Processing Systems 10: 273-279.
  • James, G.; Witten, D.; Hastie, T.; Tibshirani, R. 2013. An Introduction to Statistical Learning. Springer, New York, NY, USA.
  • Le Floch, É.; Guillemot, V, Frouin.; V, Pinel, P.; Lalanne, C.; Trinchera, L.; Tenenhaus, A.; Moreno, A.; Zilbovicius, M.; Bourgeron, T.; Dehaene, S.; Thirion, B.; Poline, J.B.; Duchesnay, É. 2012. Significant correlation between a set of genetic polymorphisms and a functional brain network revealed by feature selection and sparse Partial Least Squares. Neuroimage 63: 11-24.
  • Meuwissen, T.H.E; Hayes, B.J.; Goddard, M.E. 2001. Prediction of total genetic value using genome wide dense marker maps. Genetics 157: 1819-1829.
  • Ogunbayo, S.A.; Ojo, D.K.; Sanni, K.A.; Akinwale, M.G.; Toulou, B.; Shittu A.; Idehen, E.O.; Popoola, A.R.; Daniel, I.O.; Gregorio, G.B. 2014. Genetic variation and heritability of yield and related traits in promising rice genotypes (Oryza sativa L.). Journal of Plant Breeding and Crop Science 6: 153-159.
  • Resende, M.D.V.; Silva, F.F.; Azevedo, C.F. 2014. Mathematical, Biometric and Computational Statistics: Mixed, Multivariate, Categorical and Generalized Models (REML / BLUP), Bayesian Inference, Random Regression, Genomic Selection, QTL-GWAS, Spatial and Temporal Statistics, Competition, Survival = Estatística Matemática, Biométrica e Computacional: Modelos Mistos, Multivariados, Categóricos e Generalizados (REML/BLUP), Inferência Bayesiana, Regressão Aleatória, Seleção Genômica, QTL-GWAS, Estatística Espacial e Temporal, Competição, Sobrevivência. Editora Suprema, Visconde do Rio Branco, MG, Brazil (in Portuguese).
  • Resende, M.D.V.; Silva, F.F.; Lopes, P.S.; Azevedo, C.F. 2012. Genomic Wide Selection (GWS) by Mixed Models (REML/BLUP), Bayesian Inference (MCMC), Multivariate Random Regression (RRM) and Spatial Statistics =Seleção Genômica Ampla (GWS) via Modelos Mistos (REML/BLUP), Inferência Bayesiana (MCMC), Regressão Aleatória Multivariada e Estatística Espacial. Editora UFV,Viçosa, MG, Brazil (in Portuguese).
  • Seyoum, M.; Alamerew, S.; Bantte, K. 2012. Genetic variability, heritability, correlation coefficient and path analysis for yield and yield related traits in upland rice (Oryza sativa L.). Journal of Plant Sciences 7: 13-22.
  • Spindel, J.E.; Begum, H.; Akdemir, D.; Collard, B.; Redoña, E.; Jannink, J.L.; McCouch, S. 2016. Genome-wide prediction models that incorporate de novo GWAS are a powerful new tool for tropical rice improvement. Heredity 116: 395-408.
  • Spindel, J.E.; Begum, H.; Akdemir, D.; Virk, P.; Collard, B.; Redoña, E.; Atlin, G.; Jannink, J.L.; Mccouch, S.R. 2015. Genomic selection and association mapping in rice (Oryza sativa): effect of trait genetic architecture, training population composition, marker number and statistical model on accuracy of rice genomic selection in elite, tropical rice breeding lines. PLOS Genetics 11: e1004982.
  • Wilkinson, S.; Bishop, S.C.; Allen, A.R.; Mcbride, S.H.; Skuce, R.A.; Bermingham, M.; Woolliams, J.A.; Glass, E.J. 2017. Fine-mapping host genetic variation underlying outcomes to Mycobacterium bovis infection in dairy cows. BMC Genomics 18: 1-13.
  • Zhao, K.; Tung, C.W.; Eizenga, G.C.; Wright, M.H.; Ali, M.L.; Price, A.H.; Norton, J.G.; Islam, A.R.; Reynolds, A.; Mezey, J.; Mcclung, A.M.; Bustamante, C.D.; McClung, A.M. 2011. Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa. Nature Communications 2: 1-10.

Edited by

Edited by: Thomas Kumke

Publication Dates

  • Publication in this collection
    01 Nov 2021
  • Date of issue
    2022

History

  • Received
    21 Dec 2020
  • Accepted
    27 Aug 2021
Escola Superior de Agricultura "Luiz de Queiroz" USP/ESALQ - Scientia Agricola, Av. Pádua Dias, 11, 13418-900 Piracicaba SP Brazil, Phone: +55 19 3429-4401 / 3429-4486 - Piracicaba - SP - Brazil
E-mail: scientia@usp.br