Missing value imputation in multi-environment trials: Reconsidering the Krzanowski method

We propose a new methodology for multiple imputation when faced with missing data in multi-environmental trials with genotype-by-environment interaction, based on the imputation system developed by Krzanowski that uses the singular value decomposition (SVD) of a matrix. Several different iterative variants are described; differential weights can also be included in each variant to represent the influence of different components of SVD in the imputation process. The methods are compared through a simulation study based on three real data matrices that have values deleted randomly at different percentages, using as measure of overall accuracy a combination of the variance between imputations and their mean square deviations relative to the deleted values. The best results are shown by two of the iterative schemes that use weights belonging to the interval [0.75, 1]. These schemes provide imputations that have higher quality when compared with other multiple imputation methods based on the Krzanowski method.


INTRODUCTION
In plant breeding, multi-environment trials are important in order to test the general and specific adaptation of cultivars. A cultivar which grows in different environments will show significant fluctuations of performance in production relative to other cultivars. These changes are influenced by different environmental conditions; and in crop science literature they are commonly referred to as genotype-by-environment interactions, or G×E. Some useful references can be found in Arciniegas-Alarcón et al. (2013).
Although G×E trials are planned to be balanced, missing values occur for various reasons, such as removal of underperforming genotypes, inclusion of new genotypes, human errors or natural causes. For instance, plants might be destroyed by animals, by floods, or during the harvest, while yield measurements may be erroneously carried out or incorrectly entered into the data base (Rodrigues et al. 2011).
Hence, the result is usually an unbalanced trial which cannot be directly analyzed by recommended and efficient methods based on the AMMI model or on the biplot (Yan et al. 2007, Yang et al. 2009, Gauch 2013. The principal difficulty is that these methods involve the calculation of the singular value decomposition (SVD) of a matrix, which does not exist for matrices with missing S Arciniegas-Alarcón et al.
Possible ways of analysing G×E trials containing missing values are: i) extracting a balanced subset of data, by deleting those genotypes or environments which contain any missing values (Ceccarelli et al. 2007, Yan et al. 2011; ii) filling the missing cells with environmental means; or iii) filling the missing cells with values estimated from fitted multiplicative or mixed linear models (Arciniegas-Alarcón et al. 2011, Kumar et al. 2012. Some researchers prefer to use either the mixed linear model based on the statistical method of Restricted Maximum Likelihood/Best Linear Unbiased Prediction (REML/BLUP) or Bayesian approaches. For a description of these methodologies see, e.g., Fritsche-Neto et al. (2010), Crossa et al. (2011), Josse et al. (2014) and Omer et al. (2015).
These strategies may overcome the lack of balance in the data, but none of them is simple and effective (Yan 2013). The first strategy does not make use of all the available information; the second one may have problems when too many values are missing; and the third one involves multiple steps and complicated procedures. See Yan (2013) for details. Krzanowski (1988) proposed a perfectly general imputation scheme free of any distributional and structural constraints, which uses SVD to impute missing values. The scheme can be used on any data set that can be arranged in matrix form and, therefore, can be applied in G×E trials. The procedure provides simple imputation; however, Josse and Husson (2012) and van Buuren (2012) have warned that such methods do not take into account the uncertainty produced by the imputation. Moreover, if parameters are subsequently estimated from the imputed values, then the standard errors will be underestimated, which means that confidence intervals and tests will not be valid, even if the imputation model is correct.
In order to solve this problem, multiple imputation (MI) (Rubin 1987, Harel and Zhou 2007, Allison 2012, Rässler et al. 2013 can be used. This involves three distinct steps: i) Imputation: The missing values are estimated M times, generating M completed data sets (observed+imputed); ii) Analysis: The M completed data sets are analyzed using appropriate statistical procedures for the problem at hand; iii) Combination: The M separate sets of results are combined into one single inference.
The literature and the associated statistical software provide several alternatives of MI, such as the parametric regression, the propensity score method, or the Markov chain Monte Carlo (MCMC) method (Zhang 2003, Yuan 2011. However, these methodologies require that certain assumptions are met. The assumption in all three methods is that the missing data depend on observed variables, which means that there is a missing at random mechanism (MAR), as defined by Little and Rubin (2002). Also, parametric regression and MCMC depend on the assumption of multivariate normality.
In this paper, a method is proposed for the first step of MI that does not make any distributional or structural assumptions in G×E experiments, by modifying the algorithm presented by Krzanowski (1988), and taking related literature into account.

MATERIAL AND METHODS
In order to describe the Krzanowski method, consider a matrix Y (n×p) with elements y ij (i = 1,...n; j=1...p) and n > p (if n < p, the matrix should be first transposed). First, suppose there is just one missing value y ij in Y. Then, delete the i-th row from Y and calculate the SVD for the ((n -1)×p) resulting matrix Y (-i) as Y (-i) . For the next step, delete the j-th column from Y in order to obtain the SVD for the (n×(p -1)) matrix Y (-j) as Y (-j) ..,d p-1 ) . The matrices U, V, Ũ and Ṽ are orthonormal, while D and D are diagonal.
Afterwards, combine the two SVDs, Y (-i) and Y (-j) , and the imputed value will be given by where H = min{n -1, p -1}. When there is more than one missing value, an iterative scheme is required as follows. Initially replace all missing values by their respective column means, giving a completed matrix Y, and then standardize the columns by subtracting m j and dividing the result by s j (where m j and s j represent the mean and the standard deviation of the j-th column calculated only from the observed values). Using the standardized matrix, recalculate the ij ih jh imputation for each missing value using the equation (1). Finally, return the matrix Y to its original scale, y ij = m j + s j ŷ ij . Then iterate the process until stability is achieved in the imputations. In order to avoid convergence problems, a parity check should be done in each iteration by matching the sign of ( )( (1) to the sign of u ih d h v jh obtained from the SVD of the Y matrix for each h = 1,...,H (Eastment and Krzanowski 1982).
It is possible to avoid the parity check by using an alternative expression for (1) following the results of Bro et al. (2008). The authors suggest updating the missing y ij by the corresponding element of the matrix where (•) + represents the Moore-Penrose generalized inverse. It is noted that for each missing observation a different S matrix will be calculated; the inclusion of (2) in the algorithm makes the imputation be basically an expectation maximization (EM) operation (Bro et al. 2008).
On the other hand, to obtain MI from the Krzanowski method, Bergamo et al. (2008) proposed a generalization to the exponents of d h and d h , replacing equation (1) by where a is the weight in the interval [0,1] given to Y (-j) . Specification of a automatically determines the weight for Y (-i) . For example, a weight a=0.4 for Y (-j) gives the weight 1-a=0.6 for Y (-i) . Bergamo et al. (2008) assert that 5 imputations for each missing value is enough to ensure variability among imputations. For this reason, the authors suggest using weights a=0.4, 0.45. 0.50, 0.55 and 0.60 in turn for Y (-j) , each value of a providing a different imputation.
Recently, Arciniegas-Alarcón et al. (2014b) have used a small modification for equation (3). They included two constants to eliminate the bias produced by the systematic underestimation of D and D in relation to D (Louwerse et al. 1999), using the expression In (4), the number of components H needs first to be specified. Krzanowski (1988) and Bergamo et al. (2008) used H = min{n-1, p-1} with the objective of using the maximum available information in the matrix. However, this choice can produce low efficiency of imputation in multi-environment trials (Arciniegas-Alarcón and Dias 2009) and lead to convergence problems (Hedderley and Wakeling 1995). For these reasons, the authors obtained better results by To propose an alternative methodology of MI based on the Krzanowski method, it is suggested that a multiplicative weight w t ϵ [0,1] is included in equation (1). Thus the following expression is used: with t=1,...,M , where M represents the number of imputations, so that the inclusion of different weights will produce different imputations for each missing value. Again, M=5 can be used, since this number produces good statistical efficiency in many practical applications (van Buuren 2012). Each weight w t can be interpreted as the influence placed on the SVD components in the final imputation, similar to that in the iterative scheme using m j + s j ŷ ij . For instance, if the weight w t is low, the imputation will depend more on the column means, which are updated in each iteration. Similarly, the weight w t can also be included in equation (2) to obtain another MI form The influence of the SVD in the imputation was defined in terms of percentage. Thus, the interval between 0 and 1 was used for w t , and seven groups of values were considered: Group1 = 0, 0. With the seven groups for w t , and the two new proposed equations for imputation, there are in total 14 possible variations of MI under the Krzanowski system. These variations are denoted "SVDi+PC" and "SVDi+EM", with i =1,...,7. Here, "SVDi+PC" indicates the iterative scheme with parity check using equation (5) and group i for the values of w t , while "SVDi+EM" indicates the iterative scheme with EM imputation using the equation (6) and group i for the values of w t .
These 14 variations were assessed against the iterative scheme with parity check that uses equation (4) for imputation. This latter method was treated as the standard method (or "gold standard"), and was denoted "MIK-Adjusted". An extension of "MIK-Adjusted" was also included by assuming that the exponent a can be chosen randomly, and a~U[0,1]; this extension was called "Unif-MIK-Adjusted". Both "MIK-Adjusted" versions and the variations described above need the value of H (i.e. the number of components of SVD to be retained) to be previously determined. This value was obtained using the cv.SVDImpute function of the imputation package from R, which carries out cross-validation in incomplete matrices (Wong 2013, García-Peña et al. 2014, Arciniegas-Alarcón et al. 2014b).
To evaluate the imputation systems, the methodology proposed by Yan (2013) was used, by taking complete real multienvironmental matrices, randomly deleting observations from them at different percentages, applying the algorithms, and then using appropriate summary statistics to compare the imputations with the real data. Molale et al. (2013) describe some of these performance measures, such as: the Root Mean Square Error (RMSE), the Mean Absolute Error (MAE), the Relative Absolute Error (RAE), and the Root Relative Squared Error (RRSE). These statistics serve to evaluate any imputation system; however, to assess new MI methods, it would be more appropriate to use statistics involving the variability between imputations. For this, Penny and Jolliffe (1999) introduced the T acc statistics, which will be used in this study and which is described below.
Three data sets from G×E trials were considered, which had been previously used by Lavoranti et al. (2007), Yang (2007) and Rad et al. (2013), respectively. The first data set (Rad et al. 2013) consists of a 36×6 matrix for 36 wheat genotypes assessed in six environments under normal and drought stress conditions, at an experimental farm of the University of Putra, Malaysia. The studied variable was the mean plant grain yield (gr). The second data set (Lavoranti et al. 2007) comprises a 20×7 matrix, for 20 Eucalyptus grandis progenies assessed in seven locations in the south and southeast regions of Brazil, and the studied variable was mean tree height (m). Lastly, the third data set (Yang 2007) is a 6×18 matrix, corresponding to six barley genotypes assessed in 18 environments in Alberta, Canada. The studied variable was yield (Mg ha -1 ). Table 1 shows the results from a preliminary study on the choice of the number of multiplicative components (to explain the G×E interaction) of the AMMI model that can be used for each selected data set, on which a simulation study is described below. The method of cross-validation "leave-one-out" by eigenvector (Bro et al. 2008, Gauch 2013 was used to select each model, and the best model was the one which has the lowest PRESS statistics. It can be seen that AMMI3 is an appropriate model for the wheat matrix; AMMI1 is appropriate for the barley matrix; and AMMI2 is appropriate for the eucalyptus matrix. Therefore, these three data sets represent a wide range of data sizes and structures typically found in G×E trials, and thus provide good basis for drawing general conclusions for such trials. Each original data matrix (wheat, eucalyptus and barley) was submitted to random deletion of values at the three rates 10%, 20%, and 35%, since according to Yan (2013), in G×E trials, generally, the number of missing values is lower that 40%. The process was repeated in each data set replicated 1000 times for each percentage of missing values, giving a total of 3000 different matrices with missing values. Altogether, therefore, 9000 incomplete data sets were available, and for each one, the missing values were imputed with the 16 MI algorithms, using computational code in R (R Core Team 2014). The random deletion process for a matrix Y(n×p) was carried out as follows. Random numbers between 0 and 1 were generated in the R software with the runif function. For a fixed r value (0< r <1), if the (pi + j)-th random number was lower than r, then the element in the (i+1, j) position of the matrix was deleted (i =0,1,...,n-1; j =1,...,p). The expected proportion of missing values in the matrix will be r (Krzanowski 1988, Arciniegas-Alarcón et al. 2013). This technique was used with r = 0.1, 0.2 and 0.35.
The imputation accuracy was assessed with the T acc statistics, proposed by Penny and Jolliffe (1999), which has been recently used by Bergamo et al. (2008) and Arciniegas-Alarcón et al. (2014a). T acc is a measure of overall accuracy formed from the sum of the pooled variance between imputation within positions (V E ) and the mean squared bias between the imputations mean and the original value deleted in the simulation study (VQM). Thus: ; in which "na" is the total number of deleted values from the G×E matrix. Each deleted value l has the corresponding position (i, j) in the matrix, that is, in the i-th row and the j-th column. M is the number of imputations for the missing value l; ŷ ij(m) is the m-th imputation for that value according to the proposed methods; Y l is the imputations mean produced for the missing value l; and VO l is the original value l in the complete original data set.
According to Penny and Jolliffe (1999), a good MI method will be that with small values for V E and VQM, which implies low values of T acc . Thus, since "MIK-Adjusted" was treated as the standard method, any method having values for T acc lower than those for "MIK-Adjusted" was considered as a good imputation method. It is worth pointing out that only having a reduced value of V E does not imply good imputation quality, since the method can be biased. Table 2 shows the means and medians of T acc , at the different percentages of values deleted randomly (10, 20 and 35%) for the wheat data set. The best results at each imputation percentage (minimizing the mean and median of T acc ) were obtained by SVD4+EM and SVD5+EM. At 35% imputation, SVD4+PC and SVD5+PC were also better than any of the MIK methods. These results indicate that the weights of w t should be chosen in the interval [0.75;1]. However, the values of T acc in all the remaining 11 imputation systems were higher than for MIK-Adjusted. Therefore, the least recommended algorithms are those with w t in the interval [0;0.45], namely SVDi+PC and SVDi+EM, with i=1,2.

Wheat data
The algorithms that assumed a uniform distribution for a in the case of Unif-MIK-Adjusted; for w t , in the case of SVD7+EM and SVD7+PC deserve special comment. The mean and the median values of Unif-MIK-Adjusted are very close to those of MIK-Adjusted, but never lower, and as the percentage of imputation increases, differences remained the same or even increased. The performances of SVD3+EM and SVD3+PC are better than those of SVD7+EM and SVD7+PC at all the percentages, which means that the insertion of a random component did not improve the imputation efficiency.
One of the initial intentions of the simulation study for wheat data was to determine a set of algorithms that were better than MIK-Ajusted, and this is clearly achieved by SVD4+EM, SVD5+EM, SVD4+PC and SVD5+PC. Thus, the results will be considered consistent if at least one of the four imputation systems minimizes T acc in the other multi-environmental data matrices. Table 3 shows the means and medians of T acc , at the different percentages of values deleted randomly (10, 20 and 35%) for the eucalyptus data set. Similar behaviour was found to that described in the wheat data. For instance, i) Unif-MIK-Adjusted never beats the gold standard MIK-Adjusted, ii) SVD3+EM and SVD3+PC always perform better than SVD7+EM and SVD7+PC, respectively, and iii) the least recommended systems are SVDi+PC and SVDi+EM, with i=1,2, (i.e., the imputations less influenced by SVD due to the chosen weight), since they have the worst performance among the proposed methodologies.

Eucalyptus data
SVD4+PC, SVD5+PC and SVD4+EM beat MIK-Adjusted in all the imputation percentages. It is worth mentioning that the means and medians of T acc for SVD4+PC are lower than those of SVD4+EM, i.e., a method that involves a parity check beats an imputation of type EM -which is a different result from those found in wheat data. While in the wheat matrix SVD5+EM was better than MIK-Adjusted, here the opposite occurs. Table 4 shows the means and medians of T acc , at the different percentages of values deleted randomly (10, 20 and 35%) for the barley data set. In this case, only SVD5+EM presents better results than MIK-Adjusted, while SVD4+EM, SVD4+PC and SVD5+PC exhibit higher values of T acc than for the standard algorithm. The other 11 variations of the Krzanowski method presented the same behaviour of the wheat and eucalyptus data.

DISCUSSION
The main conclusion is that the MI methods which have been derived from the simple imputation system of Krzanowski may have advantages over the current standard method for a wide range of G×E interaction structures encountered in practice. The four imputation systems SVDi+PC and SVDi+EM with i=4,5 showed better results than the standard method in simulations involving the largest matrix (216 observations), having the most complex interaction structure. For smaller matrices and less complex interactions, a subset of these four methods beat the standard. For the matrix with 140 observations and intermediate interaction structure, this subset comprised SVDi+PC and SVDj+EM for i=4,5 and j=4, while for the smallest matrix of 108 observations and simplest interaction structure, it contained just SVD5+EM.
In the light of these results, it is clear that it is only worthwhile considering weights w in the range [0.75, 1.0] (since low values of w mean that the benefits of SVD are being ignored). However, some selection must be initially carried out between the four sets of weights and the methods identified above as the best ones. A fast path in a practical case would be to test the set of four algorithms in order to determine the best one. However, a deeper analysis of the previous results should take into consideration the fact that each method has two parts: one related to the weight w t , and the other related to the iterative scheme used (i.e. iterations with parity check or of type EM). According to T acc , the best result is obtained when w t belongs to group 5 in the three considered matrices, as this always outperforms MIK-Adjusted. On the other hand, a definitive conclusion about the iterative scheme could not be obtained. Thus, in a real situation of incomplete G×E matrices, it is recommended to test both systems SVD5+PC and SVD5+EM.
It is also worth commenting that in the context of cross-validation for components, Bro et al. (2008) pointed out that the Eastment and Krzanowski (EK) method (1982) has a problem of rotational freedom, and the parity check is a source of overfitting. However, in this study, it was found that, in some cases, the parity check within an iterative scheme can offer better predictions than the EM algorithm described by Bro et al. (2008). Thus a hypothesis to investigate in the future is that the influence of rotation problems and overfitting on the predictions might be minimized if EK is used iteratively.
Finally, the researcher may be interested in how to test SVD5+PC and SVD5+EM on a set of incomplete data. It is suggested that, starting with the observed data, some of the entries are randomly deleted (for instance, between 10% and 30%), the two algorithms are applied, and the T acc statistic values are calculated. The deletion procedure is then repeated (for example, 100 times), and the mean and median of all the values are found. The method with the lowest mean or median will be the method to be used on the matrix with missing values. Another option is to choose the method according to the magnitude of the imputation uncertainty, using for the estimation the non-parametric methodology proposed by Heydarbeygie and Ahmadi (2013).