Acessibilidade / Reportar erro

Missing value imputation in multi-environment trials: Reconsidering the Krzanowski method

Abstract

We propose a new methodology for multiple imputation when faced with missing data in multi-environmental trials with genotype-by-environment interaction, based on the imputation system developed by Krzanowski that uses the singular value decomposition (SVD) of a matrix. Several different iterative variants are described; differential weights can also be included in each variant to represent the influence of different components of SVD in the imputation process. The methods are compared through a simulation study based on three real data matrices that have values deleted randomly at different percentages, using as measure of overall accuracy a combination of the variance between imputations and their mean square deviations relative to the deleted values. The best results are shown by two of the iterative schemes that use weights belonging to the interval [0.75, 1]. These schemes provide imputations that have higher quality when compared with other multiple imputation methods based on the Krzanowski method.

Key words:
Singular value decomposition; weights; missing data; genotype-by-environment interaction; plant breeding

INTRODUCTION

In plant breeding, multi-environment trials are important in order to test the general and specific adaptation of cultivars. A cultivar which grows in different environments will show significant fluctuations of performance in production relative to other cultivars. These changes are influenced by different environmental conditions; and in crop science literature they are commonly referred to as genotype-by-environment interactions, or G x E. Some useful references can be found in Arciniegas-Alarcón et al. (2013Arciniegas-Alarcón S , García-Peña M, Krzanowski WJ and Dias CTS (2013) Deterministic imputation in multienvironment trials. ISRN Agronomy 2013: 1-17. ).

Although GxE trials are planned to be balanced, missing values occur for various reasons, such as removal of underperforming genotypes, inclusion of new genotypes, human errors or natural causes. For instance, plants might be destroyed by animals, by floods, or during the harvest, while yield measurements may be erroneously carried out or incorrectly entered into the data base (Rodrigues et al. 2011Rodrigues PC, Pereira DGS and Mexia JT (2011) A comparison between joint regression analysis and the additive main and multiplicative interaction model: the robustness with increasing amounts of missing data. Scientia Agricola 68: 697-705.).

Hence, the result is usually an unbalanced trial which cannot be directly analyzed by recommended and efficient methods based on the AMMI model or on the biplot (Yan et al. 2007Yan W, Kang MS, Ma BL, Woods S and Cornelius PL (2007) GGE biplot vs. AMMI analysis of genotype-by-environment data. Crop Science 47: 641-653., Yang et al. 2009Yang RC, Crossa J, Cornelius PL and Burgueño J (2009) Biplot analysis of genotype × environment interaction: Proceed with caution. Crop Science 49: 1564-1576., Gauch 2013Gauch H (2013) A simple protocol for AMMI analysis of yield trials. Crop Science 53: 1860-1869. ). The principal difficulty is that these methods involve the calculation of the singular value decomposition (SVD) of a matrix, which does not exist for matrices with missing values (Gabriel 2002Gabriel KR (2002) Le biplot - outil d´exploration de données multidimensionelles. Journal de la Société Française de Statistique 143: 5-55., Arciniegas-Alarcón et al. 2014aArciniegas-Alarcón S, Dias CTS and García-Peña M (2014a) Distribution-free multiple imputation in incomplete two-way tables. Pesquisa Agropecuária Brasileira 49: 683-691.).

Possible ways of analysing GxE trials containing missing values are: i) extracting a balanced subset of data, by deleting those genotypes or environments which contain any missing values (Ceccarelli et al. 2007Ceccarelli S, Grando S and Baum M (2007) Participatory plant breeding in water-limited environments. Experimental Agriculture 43: 411-435., Yan et al. 2011Yan W, Pageau D, Frégeau-Reid J and Durand J (2011) Assessing the representativeness and repeatability of test locations for genotype evaluation. Crop Science 51: 1603-1610.); ii) filling the missing cells with environmental means; or iii) filling the missing cells with values estimated from fitted multiplicative or mixed linear models (Arciniegas-Alarcón et al. 2011Arciniegas-Alarcón S, García-Peña M and Dias CTS (2011) Data imputation in trials with genotype×environment interaction. Interciencia 36: 444-449. , Kumar et al. 2012Kumar A, Verulkar SB, Mandal NP, Variar M, Shukla VD, Dwivedi JL, Singh BN, Singh ON, Swain P, Mall AK, Robin S, Chandrababu R, Jain A, Haefele SM, Piepho HP and Raman A (2012) High-yielding, drought-tolerant, stable rice genotypes for the shallow rainfed lowland droughtprone ecosystem. Field Crops Research 133: 37-47.). Some researchers prefer to use either the mixed linear model based on the statistical method of Restricted Maximum Likelihood/Best Linear Unbiased Prediction (REML/BLUP) or Bayesian approaches. For a description of these methodologies see, e.g., Fritsche-Neto et al. (2010Fritsche-Neto R, Gonçalves MC, Vencovsky R and Souza Junior CL (2010) Prediction of genotypic values of maize hybrids in unbalanced experiments. Crop Breeding and Applied Biotechnology 10: 32-39.), Crossa et al. (2011Crossa J, Perez-Elizalde S, Jarquin D, Cotes JM, Viele K, Liu G and Cornelius PL (2011) Bayesian estimation of the additive main effects and multiplicative interaction model. Crop Science 51: 1458-1469.), Josse et al. (2014Josse J, van Eeuwijk F, Piepho HP and Denis JB (2014) Another look at Bayesian analysis of AMMI models for genotype-environment data. Journal of Agricultural, Biological, and Environmental Statistics 19: 240-257.) and Omer et al. (2015Omer SO, Abdalla AWH, Mohammed MH and Singh M (2015) Bayesian estimation of genotype-by-environment interaction in sorghum variety trials. Communications in Biometry and Crop Science 10: 82-95.).

These strategies may overcome the lack of balance in the data, but none of them is simple and effective (Yan 2013Yan W (2013) Biplot analysis of incomplete two-way data. Crop Science 53: 48-57.). The first strategy does not make use of all the available information; the second one may have problems when too many values are missing; and the third one involves multiple steps and complicated procedures. See Yan (2013) for details.

Krzanowski (1988Krzanowski WJ (1988) Missing value imputation in multivariate data using the singular value decomposition of a matrix. Biometrical Letters 25: 31-39.) proposed a perfectly general imputation scheme free of any distributional and structural constraints, which uses SVD to impute missing values. The scheme can be used on any data set that can be arranged in matrix form and, therefore, can be applied in GxE trials. The procedure provides simple imputation; however, Josse and Husson (2012Josse J and Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique 153: 79-99.) and van Buuren (2012van Buuren S (2012) Flexible imputation of missing data. CRC, Boca Raton, 343p.) have warned that such methods do not take into account the uncertainty produced by the imputation. Moreover, if parameters are subsequently estimated from the imputed values, then the standard errors will be underestimated, which means that confidence intervals and tests will not be valid, even if the imputation model is correct.

In order to solve this problem, multiple imputation (MI) (Rubin 1987Rubin DB (1987) Multiple imputation for nonresponse in surveys. John Wiley & Sons, New York , 258p., Harel and Zhou 2007Harel O and Zhou XH (2007) Multiple imputation: review of theory, implementation, and software. Statistics in Medicine 26: 3057-3077., Allison 2012Allison PD (2012) Handling missing data by maximum likelihood. SAS Global Forum, Statistics and Data Analysis, 21p. (Paper 312). Available at < Available at http://www.statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf > Accessed in May, 2016.
http://www.statisticalhorizons.com/wp-co...
, Rässler et al. 2013Rässler S, Rubin D B and Zell ER (2013) Imputation. WIREs Computational Statistics 5: 20-29. ) can be used. This involves three distinct steps: i) Imputation: The missing values are estimated M times, generating M completed data sets (observed+imputed); ii) Analysis: The M completed data sets are analyzed using appropriate statistical procedures for the problem at hand; iii) Combination: The M separate sets of results are combined into one single inference.

The literature and the associated statistical software provide several alternatives of MI, such as the parametric regression, the propensity score method, or the Markov chain Monte Carlo (MCMC) method (Zhang 2003Zhang P (2003) Multiple imputation: Theory and method. International Statistical Review 71: 581-592, Yuan 2011Yuan Y (2011) Multiple imputation using SAS software. Journal of Statistical Software 45: 1-25.). However, these methodologies require that certain assumptions are met. The assumption in all three methods is that the missing data depend on observed variables, which means that there is a missing at random mechanism (MAR), as defined by Little and Rubin (2002Little R and Rubin D (2002) Statistical analysis with missing data. 2nd edn, John Wiley & Sons, New York, 408p.). Also, parametric regression and MCMC depend on the assumption of multivariate normality.

In this paper, a method is proposed for the first step of MI that does not make any distributional or structural assumptions in GxE experiments, by modifying the algorithm presented by Krzanowski (1988Krzanowski WJ (1988) Missing value imputation in multivariate data using the singular value decomposition of a matrix. Biometrical Letters 25: 31-39.), and taking related literature into account.

MATERIAL AND METHODS

In order to describe the Krzanowski method, consider a matrix Y (n x p) with elements yij and (if , the matrix should be first transposed). First, suppose there is just one missing value yij inY. Then, delete the i-th row from Y and calculate the SVD for the ((n-1) x p) resulting matrix Y(-i) as . For the next step, delete the j-th column from Y in order to obtain the SVD for the (n x(p-1)) matrix Y(-j) as. The matrices are orthonormal, while are diagonal. Afterwards, combine the two SVDs, Y(-i) and Y(-j), and the imputed value will be given by

(1)

where . When there is more than one missing value, an iterative scheme is required as follows. Initially replace all missing values by their respective column means, giving a completed matrixY, and then standardize the columns by subtracting mj and dividing the result by sj (where mj and sj represent the mean and the standard deviation of the j-th column calculated only from the observed values). Using the standardized matrix, recalculate the imputation for each missing value using the equation (1). Finally, return the matrix Y to its original scale,. Then iterate the process until stability is achieved in the imputations. In order to avoid convergence problems, a parity check should be done in each iteration by matching the sign of in (1) to the sign of obtained from the SVD of the Y matrix for each (Eastment and Krzanowski 1982Eastment HT and Krzanowski WJ (1982) Cross-validatory choice of the number of components from a principal component analysis. Technometrics 24: 73-77.).

It is possible to avoid the parity check by using an alternative expression for (1) following the results of Bro et al. (2008Bro R, Kjeldahl K, Smilde AK and Kiers HAL (2008) Cross-validation of component models: a critical look at current methods. Analytical and Bioanalytical Chemistry 390: 1241-1251. ). The authors suggest updating the missing yij by the corresponding element of the matrix

(2)

where represents the Moore-Penrose generalized inverse. It is noted that for each missing observation a different S matrix will be calculated; the inclusion of (2) in the algorithm makes the imputation be basically an expectation maximization (EM) operation (Bro et al. 2008Bro R, Kjeldahl K, Smilde AK and Kiers HAL (2008) Cross-validation of component models: a critical look at current methods. Analytical and Bioanalytical Chemistry 390: 1241-1251. ).

On the other hand, to obtain MI from the Krzanowski method, Bergamo et al. (2008Bergamo GC, Dias CTS and Krzanowski WJ (2008) Distribution free-multiple imputation in an interaction matrix through singular value decomposition. Scientia Agricola 65: 422-427.) proposed a generalization to the exponents of and , replacing equation (1) by

(3)

where a is the weight in the interval [0,1] given to . Specification of a automatically determines the weight for . For example, a weight a=0.4 for gives the weight 1-a=0.6 for . Bergamo et al. (2008Bergamo GC, Dias CTS and Krzanowski WJ (2008) Distribution free-multiple imputation in an interaction matrix through singular value decomposition. Scientia Agricola 65: 422-427.) assert that 5 imputations for each missing value is enough to ensure variability among imputations. For this reason, the authors suggest using weights a=0.4, 0.45. 0.50, 0.55 and 0.60 in turn for , each value of a providing a different imputation.

Recently, Arciniegas-Alarcón et al. (2014bArciniegas-Alarcón S , García-Peña M, Krzanowski WJ and Dias CTS (2014b) Imputing missing values in multi-environment trials using the singular value decomposition: An empirical comparison. Communications in Biometry and Crop Science 9: 54-70. ) have used a small modification for equation (3). They included two constants to eliminate the bias produced by the systematic underestimation of and in relation to (Louwerse et al. 1999Louwerse DJ, Smilde AK and Kiers HAL (1999) Cross-validation of multiway component models. Journal of Chemometrics 13: 491-510. ), using the expression

(4)

In (4), the number of components needs first to be specified. Krzanowski (1988Krzanowski WJ (1988) Missing value imputation in multivariate data using the singular value decomposition of a matrix. Biometrical Letters 25: 31-39.) and Bergamo et al. (2008Bergamo GC, Dias CTS and Krzanowski WJ (2008) Distribution free-multiple imputation in an interaction matrix through singular value decomposition. Scientia Agricola 65: 422-427.) used with the objective of using the maximum available information in the matrix. However, this choice can produce low efficiency of imputation in multi-environment trials (Arciniegas-Alarcón and Dias 2009Arciniegas-Alarcón S and Dias CTS (2009) Data imputation in trials with genotype by environment interaction: an application on cotton data. Biometric Brazilian Journal 27: 125-138.) and lead to convergence problems (Hedderley and Wakeling 1995Hedderley D and Wakeling I (1995) A comparison of imputation techniques for internal preference mapping, using Monte Carlo simulation. Food Quality and Preference 6: 281-297.). For these reasons, the authors obtained better results by choosing , where is such that and is such that.

To propose an alternative methodology of MI based on the Krzanowski method, it is suggested that a multiplicative weight is included in equation (1). Thus the following expression is used:

(5)

with , where represents the number of imputations, so that the inclusion of different weights will produce different imputations for each missing value. Again, can be used, since this number produces good statistical efficiency in many practical applications (van Buuren 2012van Buuren S (2012) Flexible imputation of missing data. CRC, Boca Raton, 343p.). Each weight can be interpreted as the influence placed on the SVD components in the final imputation, similar to that in the iterative scheme using. For instance, if the weight is low, the imputation will depend more on the column means, which are updated in each iteration. Similarly, the weight can also be included in equation (2) to obtain another MI form

(6)

The influence of the SVD in the imputation was defined in terms of percentage. Thus, the interval between 0 and 1 was used for , and seven groups of values were considered: Group1 = 0, 0.05, 0.1, 0.15, 0.2; Group2 = 0.25, 0.3, 0.35, 0.4, 0.45; Group3 = 0.5, 0.55, 0.60, 0.65, 0.7; Group4 = 0.75, 0.8, 0.85, 0.9, 0.95; Group5 = 0.96, 0.97, 0.98, 0.99, 1; Group 6: 0.2, 0.4, 0.6, 0.8, 1; and Group7 contains 5 random numbers from the uniform distribution, i.e.,. The groups were empirically defined following the work of Arciniegas-Alarcón et al. (2014aArciniegas-Alarcón S, Dias CTS and García-Peña M (2014a) Distribution-free multiple imputation in incomplete two-way tables. Pesquisa Agropecuária Brasileira 49: 683-691.), which proposed MI in incomplete two-way tables through a variation in the exponents of the diagonal matrix elements obtained in the calculation of a SVD.

With the seven groups for , and the two new proposed equations for imputation, there are in total 14 possible variations of MI under the Krzanowski system. These variations are denoted "SVDi+PC" and "SVDi+EM", with i =1,...,7. Here, "SVDi+PC" indicates the iterative scheme with parity check using equation (5) and group i for the values of , while "SVDi+EM" indicates the iterative scheme with EM imputation using the equation (6) and group i for the values of .

These 14 variations were assessed against the iterative scheme with parity check that uses equation (4) for imputation. This latter method was treated as the standard method (or "gold standard"), and was denoted "MIK-Adjusted". An extension of "MIK-Adjusted" was also included by assuming that the exponent can be chosen randomly, and; this extension was called "Unif-MIK-Adjusted". Both "MIK-Adjusted" versions and the variations described above need the value of (i.e. the number of components of SVD to be retained) to be previously determined. This value was obtained using the cv.SVDImpute function of the imputation package from R, which carries out cross-validation in incomplete matrices (Wong 2013Wong J (2013) Imputation. Version 2.0.1. Available at: <Available at: http://CRAN.Rproject.org/package=imputation >. Accessed on October 15, 2013.
http://CRAN.Rproject.org/package=imputat...
, García-Peña et al. 2014García-Peña M, Arciniegas-Alarcón S and Barbin D (2014) Climate data imputation using the singular value decomposition: an empirical comparison. Revista Brasileira de Meteorologia 29: 527-536., Arciniegas-Alarcón et al. 2014bArciniegas-Alarcón S , García-Peña M, Krzanowski WJ and Dias CTS (2014b) Imputing missing values in multi-environment trials using the singular value decomposition: An empirical comparison. Communications in Biometry and Crop Science 9: 54-70. ).

To evaluate the imputation systems, the methodology proposed by Yan (2013Yan W (2013) Biplot analysis of incomplete two-way data. Crop Science 53: 48-57.) was used, by taking complete real multi-environmental matrices, randomly deleting observations from them at different percentages, applying the algorithms, and then using appropriate summary statistics to compare the imputations with the real data. Molale et al. (2013Molale P, Twala B and Seeletse S (2013) Fingerprint prediction using statistical and machine learning methods. ICIC Express Letters 7: 1-6.) describe some of these performance measures, such as: the Root Mean Square Error (RMSE), the Mean Absolute Error (MAE), the Relative Absolute Error (RAE), and the Root Relative Squared Error (RRSE). These statistics serve to evaluate any imputation system; however, to assess new MI methods, it would be more appropriate to use statistics involving the variability between imputations. For this, Penny and Jolliffe (1999Penny KI and Jolliffe IT (1999) Multivariate outlier detection applied to multiply imputed laboratory data. Statistics in Medicine 18: 1879-1895.) introduced the Tacc statistics, which will be used in this study and which is described below.

Three data sets from GE trials were considered, which had been previously used by Lavoranti et al. (2007Lavoranti OJ, Dias CTS and Krzanowski WJ (2007) Phenotypic stability and adaptability via AMMI model with bootstrap re-sampling. Pesquisa Florestal Brasileira 54: 45-52. ), Yang (2007Yang RC (2007) Mixed-model analysis of crossover genotype-environment interactions. Crop Science 47: 1051-1062.) and Rad et al. (2013Rad MRN, Kadir MA, Rafii MY, Jaafar HZE, Naghavi MR and Ahmadi F (2013) Genotype × environment interaction by AMMI and GGE biplot analysis in three consecutive generations of wheat (Triticum aestivum) under normal and drought stress conditions. Australian Journal of Crop Science 7: 956-961.), respectively. The first data set (Rad et al. 2013) consists of a matrix for 36 wheat genotypes assessed in six environments under normal and drought stress conditions, at an experimental farm of the University of Putra, Malaysia. The studied variable was the mean plant grain yield (gr). The second data set (Lavoranti et al. 2007Lavoranti OJ, Dias CTS and Krzanowski WJ (2007) Phenotypic stability and adaptability via AMMI model with bootstrap re-sampling. Pesquisa Florestal Brasileira 54: 45-52. ) comprises a matrix, for 20 Eucalyptus grandis progenies assessed in seven locations in the south and southeast regions of Brazil, and the studied variable was mean tree height (m). Lastly, the third data set (Yang 2007Yang RC (2007) Mixed-model analysis of crossover genotype-environment interactions. Crop Science 47: 1051-1062.) is a matrix, corresponding to six barley genotypes assessed in 18 environments in Alberta, Canada. The studied variable was yield (Mg ha-1).

Table 1 shows the results from a preliminary study on the choice of the number of multiplicative components (to explain the G×E interaction) of the AMMI model that can be used for each selected data set, on which a simulation study is described below. The method of cross-validation "leave-one-out" by eigenvector (Bro et al. 2008Bro R, Kjeldahl K, Smilde AK and Kiers HAL (2008) Cross-validation of component models: a critical look at current methods. Analytical and Bioanalytical Chemistry 390: 1241-1251. , Gauch 2013Gauch H (2013) A simple protocol for AMMI analysis of yield trials. Crop Science 53: 1860-1869. ) was used to select each model, and the best model was the one which has the lowest PRESS statistics. It can be seen that AMMI3 is an appropriate model for the wheat matrix; AMMI1 is appropriate for the barley matrix; and AMMI2 is appropriate for the eucalyptus matrix. Therefore, these three data sets represent a wide range of data sizes and structures typically found in GE trials, and thus provide good basis for drawing general conclusions for such trials.

Table 1
Values of Predicted REsidual Sum of Squares (PRESS) using cross validation by eigenvector in choosing the AMMI model to explain the interaction in the original (complete) data matrices

Each original data matrix (wheat, eucalyptus and barley) was submitted to random deletion of values at the three rates 10%, 20%, and 35%, since according to Yan (2013Yan W (2013) Biplot analysis of incomplete two-way data. Crop Science 53: 48-57.), in GE trials, generally, the number of missing values is lower that 40%. The process was repeated in each data set replicated 1000 times for each percentage of missing values, giving a total of 3000 different matrices with missing values. Altogether, therefore, 9000 incomplete data sets were available, and for each one, the missing values were imputed with the 16 MI algorithms, using computational code in R (R Core Team 2014R Core Team (2014) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. Available at <Available at http://www.R-project.org /.> Accessed in may 2014.
http://www.R-project.org...
).

The random deletion process for a matrix was carried out as follows. Random numbers between 0 and 1 were generated in the R software with the runif function. For a fixed value , if the -th random number was lower than , then the element in the position of the matrix was deleted . The expected proportion of missing values in the matrix will be (Krzanowski 1988Krzanowski WJ (1988) Missing value imputation in multivariate data using the singular value decomposition of a matrix. Biometrical Letters 25: 31-39., Arciniegas-Alarcón et al. 2013Arciniegas-Alarcón S , García-Peña M, Krzanowski WJ and Dias CTS (2013) Deterministic imputation in multienvironment trials. ISRN Agronomy 2013: 1-17. ). This technique was used with r = 0.1, 0.2 and 0.35.

The imputation accuracy was assessed with the Tacc statistics, proposed by Penny and Jolliffe (1999Penny KI and Jolliffe IT (1999) Multivariate outlier detection applied to multiply imputed laboratory data. Statistics in Medicine 18: 1879-1895.), which has been recently used by Bergamo et al. (2008Bergamo GC, Dias CTS and Krzanowski WJ (2008) Distribution free-multiple imputation in an interaction matrix through singular value decomposition. Scientia Agricola 65: 422-427.) and Arciniegas-Alarcón et al. (2014aArciniegas-Alarcón S, Dias CTS and García-Peña M (2014a) Distribution-free multiple imputation in incomplete two-way tables. Pesquisa Agropecuária Brasileira 49: 683-691.). Tacc is a measure of overall accuracy formed from the sum of the pooled variance between imputation within positions (VE) and the mean squared bias between the imputations mean and the original value deleted in the simulation study (VQM). Thus: , where and; in which "" is the total number of deleted values from the GE matrix. Each deleted value has the corresponding position (i, j) in the matrix, that is, in the i-th row and the j-th column. is the number of imputations for the missing value ; is the m-th imputation for that value according to the proposed methods; is the imputations mean produced for the missing value ; and is the original value in the complete original data set.

According to Penny and Jolliffe (1999Penny KI and Jolliffe IT (1999) Multivariate outlier detection applied to multiply imputed laboratory data. Statistics in Medicine 18: 1879-1895.), a good MI method will be that with small values for VE and VQM, which implies low values of Tacc. Thus, since "MIK-Adjusted" was treated as the standard method, any method having values for Tacc lower than those for "MIK-Adjusted" was considered as a good imputation method. It is worth pointing out that only having a reduced value of VE does not imply good imputation quality, since the method can be biased.

RESULTS

Wheat data

Table 2 shows the means and medians of Tacc, at the different percentages of values deleted randomly (10, 20 and 35%) for the wheat data set. The best results at each imputation percentage (minimizing the mean and median of Tacc) were obtained by SVD4+EM and SVD5+EM. At 35% imputation, SVD4+PC and SVD5+PC were also better than any of the MIK methods. These results indicate that the weights of should be chosen in the interval . However, the values of Tacc in all the remaining 11 imputation systems were higher than for MIK-Adjusted. Therefore, the least recommended algorithms are those with in the interval, namely SVDi+PC and SVDi+EM, with i=1,2.

Table 2
Means and medians of Tacc at different percentage levels for the wheat data

The algorithms that assumed a uniform distribution for in the case of Unif-MIK-Adjusted; for , in the case of SVD7+EM and SVD7+PC deserve special comment. The mean and the median values of Unif-MIK-Adjusted are very close to those of MIK-Adjusted, but never lower, and as the percentage of imputation increases, differences remained the same or even increased. The performances of SVD3+EM and SVD3+PC are better than those of SVD7+EM and SVD7+PC at all the percentages, which means that the insertion of a random component did not improve the imputation efficiency.

One of the initial intentions of the simulation study for wheat data was to determine a set of algorithms that were better than MIK-Ajusted, and this is clearly achieved by SVD4+EM, SVD5+EM, SVD4+PC and SVD5+PC. Thus, the results will be considered consistent if at least one of the four imputation systems minimizes Tacc in the other multi-environmental data matrices.

Eucalyptus data

Table 3 shows the means and medians of Tacc, at the different percentages of values deleted randomly (10, 20 and 35%) for the eucalyptus data set. Similar behaviour was found to that described in the wheat data. For instance, i) Unif-MIK-Adjusted never beats the gold standard MIK-Adjusted, ii) SVD3+EM and SVD3+PC always perform better than SVD7+EM and SVD7+PC, respectively, and iii) the least recommended systems are SVDi+PC and SVDi+EM, with i=1,2, (i.e., the imputations less influenced by SVD due to the chosen weight), since they have the worst performance among the proposed methodologies.

Table 3
Means and medians of Tacc at different percentage levels for the eucalyptus data

SVD4+PC, SVD5+PC and SVD4+EM beat MIK-Adjusted in all the imputation percentages. It is worth mentioning that the means and medians of Tacc for SVD4+PC are lower than those of SVD4+EM, i.e., a method that involves a parity check beats an imputation of type EM - which is a different result from those found in wheat data. While in the wheat matrix SVD5+EM was better than MIK-Adjusted, here the opposite occurs.

Barley data

Table 4 shows the means and medians of Tacc, at the different percentages of values deleted randomly (10, 20 and 35%) for the barley data set. In this case, only SVD5+EM presents better results than MIK-Adjusted, while SVD4+EM, SVD4+PC and SVD5+PC exhibit higher values of Tacc than for the standard algorithm. The other 11 variations of the Krzanowski method presented the same behaviour of the wheat and eucalyptus data.

Table 4
Means and medians of Tacc at different percentage levels for the barley data

DISCUSSION

The main conclusion is that the MI methods which have been derived from the simple imputation system of Krzanowski may have advantages over the current standard method for a wide range of GE interaction structures encountered in practice. The four imputation systems SVDi+PC and SVDi+EM with i=4,5 showed better results than the standard method in simulations involving the largest matrix (216 observations), having the most complex interaction structure. For smaller matrices and less complex interactions, a subset of these four methods beat the standard. For the matrix with 140 observations and intermediate interaction structure, this subset comprised SVDi+PC and SVDj+EM for i=4,5 and j=4, while for the smallest matrix of 108 observations and simplest interaction structure, it contained just SVD5+EM.

In the light of these results, it is clear that it is only worthwhile considering weights w in the range [0.75, 1.0] (since low values of w mean that the benefits of SVD are being ignored). However, some selection must be initially carried out between the four sets of weights and the methods identified above as the best ones. A fast path in a practical case would be to test the set of four algorithms in order to determine the best one. However, a deeper analysis of the previous results should take into consideration the fact that each method has two parts: one related to the weight , and the other related to the iterative scheme used (i.e. iterations with parity check or of type EM). According to Tacc, the best result is obtained when belongs to group 5 in the three considered matrices, as this always outperforms MIK-Adjusted. On the other hand, a definitive conclusion about the iterative scheme could not be obtained. Thus, in a real situation of incomplete GE matrices, it is recommended to test both systems SVD5+PC and SVD5+EM.

It is also worth commenting that in the context of cross-validation for components, Bro et al. (2008Bro R, Kjeldahl K, Smilde AK and Kiers HAL (2008) Cross-validation of component models: a critical look at current methods. Analytical and Bioanalytical Chemistry 390: 1241-1251. ) pointed out that the Eastment and Krzanowski (EK) method (1982Eastment HT and Krzanowski WJ (1982) Cross-validatory choice of the number of components from a principal component analysis. Technometrics 24: 73-77.) has a problem of rotational freedom, and the parity check is a source of overfitting. However, in this study, it was found that, in some cases, the parity check within an iterative scheme can offer better predictions than the EM algorithm described by Bro et al. (2008Bro R, Kjeldahl K, Smilde AK and Kiers HAL (2008) Cross-validation of component models: a critical look at current methods. Analytical and Bioanalytical Chemistry 390: 1241-1251. ). Thus a hypothesis to investigate in the future is that the influence of rotation problems and overfitting on the predictions might be minimized if EK is used iteratively.

Finally, the researcher may be interested in how to test SVD5+PC and SVD5+EM on a set of incomplete data. It is suggested that, starting with the observed data, some of the entries are randomly deleted (for instance, between 10% and 30%), the two algorithms are applied, and the Tacc statistic values are calculated. The deletion procedure is then repeated (for example, 100 times), and the mean and median of all the values are found. The method with the lowest mean or median will be the method to be used on the matrix with missing values. Another option is to choose the method according to the magnitude of the imputation uncertainty, using for the estimation the non-parametric methodology proposed by Heydarbeygie and Ahmadi (2013Heydarbeygie A and Ahmadi N (2013) Nonparametric methods for the estimation of imputation uncertainty. Journal of Applied Statistics 40: 693-698.).

ACKNOWLEDGEMENTS

The first author thanks the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior, CAPES, Brazil, (PEC-PG program) for the financial support. The second author thanks the National Council of Technological and Scientific Development, CNPq, Brazil, and the Academy of Sciences for the Developing World, TWAS, Italy, (CNPq-TWAS program) for the financial support.

REFERENCES

  • Allison PD (2012) Handling missing data by maximum likelihood. SAS Global Forum, Statistics and Data Analysis, 21p. (Paper 312). Available at < Available at http://www.statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf > Accessed in May, 2016.
    » http://www.statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf
  • Arciniegas-Alarcón S and Dias CTS (2009) Data imputation in trials with genotype by environment interaction: an application on cotton data. Biometric Brazilian Journal 27: 125-138.
  • Arciniegas-Alarcón S, Dias CTS and García-Peña M (2014a) Distribution-free multiple imputation in incomplete two-way tables. Pesquisa Agropecuária Brasileira 49: 683-691.
  • Arciniegas-Alarcón S, García-Peña M and Dias CTS (2011) Data imputation in trials with genotype×environment interaction. Interciencia 36: 444-449.
  • Arciniegas-Alarcón S , García-Peña M, Krzanowski WJ and Dias CTS (2013) Deterministic imputation in multienvironment trials. ISRN Agronomy 2013: 1-17.
  • Arciniegas-Alarcón S , García-Peña M, Krzanowski WJ and Dias CTS (2014b) Imputing missing values in multi-environment trials using the singular value decomposition: An empirical comparison. Communications in Biometry and Crop Science 9: 54-70.
  • Bergamo GC, Dias CTS and Krzanowski WJ (2008) Distribution free-multiple imputation in an interaction matrix through singular value decomposition. Scientia Agricola 65: 422-427.
  • Bro R, Kjeldahl K, Smilde AK and Kiers HAL (2008) Cross-validation of component models: a critical look at current methods. Analytical and Bioanalytical Chemistry 390: 1241-1251.
  • Ceccarelli S, Grando S and Baum M (2007) Participatory plant breeding in water-limited environments. Experimental Agriculture 43: 411-435.
  • Crossa J, Perez-Elizalde S, Jarquin D, Cotes JM, Viele K, Liu G and Cornelius PL (2011) Bayesian estimation of the additive main effects and multiplicative interaction model. Crop Science 51: 1458-1469.
  • Eastment HT and Krzanowski WJ (1982) Cross-validatory choice of the number of components from a principal component analysis. Technometrics 24: 73-77.
  • Fritsche-Neto R, Gonçalves MC, Vencovsky R and Souza Junior CL (2010) Prediction of genotypic values of maize hybrids in unbalanced experiments. Crop Breeding and Applied Biotechnology 10: 32-39.
  • Gabriel KR (2002) Le biplot - outil d´exploration de données multidimensionelles. Journal de la Société Française de Statistique 143: 5-55.
  • García-Peña M, Arciniegas-Alarcón S and Barbin D (2014) Climate data imputation using the singular value decomposition: an empirical comparison. Revista Brasileira de Meteorologia 29: 527-536.
  • Gauch H (2013) A simple protocol for AMMI analysis of yield trials. Crop Science 53: 1860-1869.
  • Harel O and Zhou XH (2007) Multiple imputation: review of theory, implementation, and software. Statistics in Medicine 26: 3057-3077.
  • Hedderley D and Wakeling I (1995) A comparison of imputation techniques for internal preference mapping, using Monte Carlo simulation. Food Quality and Preference 6: 281-297.
  • Heydarbeygie A and Ahmadi N (2013) Nonparametric methods for the estimation of imputation uncertainty. Journal of Applied Statistics 40: 693-698.
  • Josse J and Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique 153: 79-99.
  • Josse J, van Eeuwijk F, Piepho HP and Denis JB (2014) Another look at Bayesian analysis of AMMI models for genotype-environment data. Journal of Agricultural, Biological, and Environmental Statistics 19: 240-257.
  • Krzanowski WJ (1988) Missing value imputation in multivariate data using the singular value decomposition of a matrix. Biometrical Letters 25: 31-39.
  • Kumar A, Verulkar SB, Mandal NP, Variar M, Shukla VD, Dwivedi JL, Singh BN, Singh ON, Swain P, Mall AK, Robin S, Chandrababu R, Jain A, Haefele SM, Piepho HP and Raman A (2012) High-yielding, drought-tolerant, stable rice genotypes for the shallow rainfed lowland droughtprone ecosystem. Field Crops Research 133: 37-47.
  • Lavoranti OJ, Dias CTS and Krzanowski WJ (2007) Phenotypic stability and adaptability via AMMI model with bootstrap re-sampling. Pesquisa Florestal Brasileira 54: 45-52.
  • Little R and Rubin D (2002) Statistical analysis with missing data. 2nd edn, John Wiley & Sons, New York, 408p.
  • Louwerse DJ, Smilde AK and Kiers HAL (1999) Cross-validation of multiway component models. Journal of Chemometrics 13: 491-510.
  • Molale P, Twala B and Seeletse S (2013) Fingerprint prediction using statistical and machine learning methods. ICIC Express Letters 7: 1-6.
  • Omer SO, Abdalla AWH, Mohammed MH and Singh M (2015) Bayesian estimation of genotype-by-environment interaction in sorghum variety trials. Communications in Biometry and Crop Science 10: 82-95.
  • Penny KI and Jolliffe IT (1999) Multivariate outlier detection applied to multiply imputed laboratory data. Statistics in Medicine 18: 1879-1895.
  • R Core Team (2014) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. Available at <Available at http://www.R-project.org /.> Accessed in may 2014.
    » http://www.R-project.org
  • Rad MRN, Kadir MA, Rafii MY, Jaafar HZE, Naghavi MR and Ahmadi F (2013) Genotype × environment interaction by AMMI and GGE biplot analysis in three consecutive generations of wheat (Triticum aestivum) under normal and drought stress conditions. Australian Journal of Crop Science 7: 956-961.
  • Rässler S, Rubin D B and Zell ER (2013) Imputation. WIREs Computational Statistics 5: 20-29.
  • Rodrigues PC, Pereira DGS and Mexia JT (2011) A comparison between joint regression analysis and the additive main and multiplicative interaction model: the robustness with increasing amounts of missing data. Scientia Agricola 68: 697-705.
  • Rubin DB (1987) Multiple imputation for nonresponse in surveys. John Wiley & Sons, New York , 258p.
  • van Buuren S (2012) Flexible imputation of missing data. CRC, Boca Raton, 343p.
  • Wong J (2013) Imputation. Version 2.0.1. Available at: <Available at: http://CRAN.Rproject.org/package=imputation >. Accessed on October 15, 2013.
    » http://CRAN.Rproject.org/package=imputation
  • Yan W (2013) Biplot analysis of incomplete two-way data. Crop Science 53: 48-57.
  • Yan W, Kang MS, Ma BL, Woods S and Cornelius PL (2007) GGE biplot vs. AMMI analysis of genotype-by-environment data. Crop Science 47: 641-653.
  • Yan W, Pageau D, Frégeau-Reid J and Durand J (2011) Assessing the representativeness and repeatability of test locations for genotype evaluation. Crop Science 51: 1603-1610.
  • Yang RC (2007) Mixed-model analysis of crossover genotype-environment interactions. Crop Science 47: 1051-1062.
  • Yang RC, Crossa J, Cornelius PL and Burgueño J (2009) Biplot analysis of genotype × environment interaction: Proceed with caution. Crop Science 49: 1564-1576.
  • Yuan Y (2011) Multiple imputation using SAS software. Journal of Statistical Software 45: 1-25.
  • Zhang P (2003) Multiple imputation: Theory and method. International Statistical Review 71: 581-592

Publication Dates

  • Publication in this collection
    June 2016

History

  • Received
    06 Nov 2014
  • Accepted
    15 Dec 2015
Crop Breeding and Applied Biotechnology Universidade Federal de Viçosa, Departamento de Fitotecnia, 36570-000 Viçosa - Minas Gerais/Brasil, Tel.: (55 31)3899-2611, Fax: (55 31)3899-2611 - Viçosa - MG - Brazil
E-mail: cbab@ufv.br