DISTRIBUTION-FREE MULTIPLE IMPUTATION IN AN INTERACTION MATRIX THROUGH SINGULAR VALUE DECOMPOSITION

Some techniques of multivariate statistical analysis can only be conducted on a complete data matrix, but the process of data collection often misses some elements. Imputation is a technique by which the missing elements are replaced by plausible values, so that a valid analysis can be performed on the completed data set. A multiple imputation method is proposed based on a modification to the singular value decomposition (SVD) method for single imputation, developed by Krzanowski. The method was evaluated on a genotype × environment (G × E) interaction matrix obtained from a randomized blocks experiment on Eucalyptus grandis grown in multienvironments. Values of E. grandis heights in the G × E complete interaction matrix were deleted randomly at three different rates (5%, 10%, 30%) and were then imputed by the proposed methodology. The results were assessed by means of a general measure of performance (Tacc), and showed a small bias when compared to the original data. However, bias values were greater than the variability of imputations relative to their mean, indicating a smaller accuracy of the proposed method in relation to its precision. The proposed methodology uses the maximum amount of available information, does not have any restrictions regarding the pattern or mechanism of the missing values, and is free of assumptions on the data distribution or structure.


INTRODUCTION
Imputation is a technique in which missing el-ements of a data matrix are replaced by plausible values, so that a valid analysis can be made on the complete data set (observed + imputed).Various imputa-Sci.Agric.(Piracicaba, Braz.), v.65, n.4, p.422-427, July/August 2008 tion methods have been proposed over the years, but the current interest is now focussed mostly on multiple imputation (MI).MI was first proposed by Rubin (1978), and several other references including Little & Rubin (1987, 2002); Rubin (1987); Rubin & Schenker (1986); Schafer (1997Schafer ( , 1999)); Tanner & Wong (1987) and Zhang (2003) provide excellent descriptions of the technique.The basic idea of the procedure is to replace each missing value by a set of M imputed values that are drawn from the data distribution, with the variation in these values representing the uncertainty about the true value to be imputed.
The MI procedure involves three distinct steps (i) Imputation: The missing values are estimated M times, generating M completed data sets; (ii) Analysis: The M completed data sets are analyzed, using appropriate statistical procedures for the problem at hand, and (iii) Combination: The M separate sets of results are combined into one single inference.
The imputation is the most critical step, and the model used in this step is not necessarily the same as the one used in the analysis step.This makes the MI procedure more attractive, as the model used to impute data is not always the best one suited for the analysis.
On combining the results of the M analyses, the combined estimative variance consists of the variance within the imputations and, therefore, the uncertainty in the imputed data is incorporated into the final inference.
A method is here proposed for the first step of the multiple imputation, without making any assumptions about the data distribution or structure, by using the singular value decomposition (SVD) of a genotype × environment (G × E) interaction matrix for those models whose analysis needs a complete matrix.The performance of the method is then investigated on randomly deleted entries in the G × E interaction matrix obtained from an experiment on Eucalyptus grandis progenies.

MATERIAL AND METHODS
The data used in this study were obtained from experiments conducted in seven environments, at the south and southeast regions of Brazil, on 20 Eucalyptus grandis progenies obtained from Australia (km 12, South of Ravenshoe-Mt Pandanus-QLD, 43°41'10" W and 22°45'30" S at 33 m above sea level, lot 14,420).A randomized block design with six plants per plot and ten replicates was used, the whole experiment taking up a space of dimension 3.0 m by 2.0 m (Lavoranti, 2003).
For a distribution free approach, the imputed values were obtained by means of a modification to the simple imputation system developed by Krzanowski (1988).This method starts from the observation made by Good (1969), that any matrix Y (n,p) can be decomposed by the singular value decomposition into the form d of Y T Y; while the j th column u j = (u 1j ,..., u nj ) T of the U n×p matrix is the eigenvector corresponding to the j th largest eigenvalue 2 j d of YY T .The decomposition (1) has its elementwise representation as follows: Krzanowski (1987) used this representation as a basis for determining the dimensionality of a multivariate data set.If the data structure is essentially Hdimensional then the variation in the remaining (p -H) dimensions can be treated as random noise.The main features of the data can thus be supposed to lie in the space of the first H principal components.The correspondence between the quantities on the right-hand side of (2) and the principal axes of the data configuration suggests, therefore, the H-component model Supposing that the model (3) holds for a specified value of H, but the single observation y ij is missing from the data matrix, then y ij can be estimated by in which the u ih , d h , v jh , must be estimated from the remaining data.The best estimates of these latter quantities will be those that use the maximal amount of data.Denote by Y (-i) the data matrix obtained on deleting the i th row from Y, and by Y (-j) the data matrix obtained on deleting the j th column from Y, and suppose that the singular value decompositions of these matrices are given by: The maximum-data estimates of u ih and v jh in (4) are clearly and , respectively, while d h can be estimated either by , or by some combination of the two.Krzanowski (1988) suggested using h h d d ~ as a suitable compromise, so that an estimate of the missing value y ij is given by: å Following the maximum-data precept, the highest possible value of H should be used.From (6) this is evidently p -1, so that the value to be imputed for y ij will be: An iterative numerical process is needed to find the appropriate quantities in (7).Starting with initial "guesses" for the missing values, each iteration requires singular value decompositions for the reduced matrices Y (-i) , Y (-j) for every (i, j) where there is a missing value.The use of (7) provides the updated imputation for that position, and the process continues until con-vergence (i.e.stability in the successive imputed values).The initial "guesses" for the missing values y ij are most easily provided by the mean of the j th column of the existing values.To avoid any possible variation in the influence among columns, for example caused by different measurement scales, it is recommended to first standardise Y (completed using the initial "guesses").The Y (-i) and Y (-j) matrices should also be standardised at each step, so that all operations are conducted on standardised quantities.At the end, the completed matrix Y (i.e.observed + imputed values) should be returned to its original scale.Thus if represents each value of the completed matrix, the j th column mean ( ) and standard deviation ( ) are calculated and each value of the completed matrix Y is obtained in its original scale as ).In this way, different weights can be assigned to ( 5) and ( 6) in the final estimate of y ij in (7), by varying the exponents of , whereas the current form forces them to have equal weights.
Each different value of a ~ and consequently of a generates a new completed matrix Y, thus providing a mechanism for generating the M different completed data sets at the first step of the multiple imputation process.
The number of imputations is governed by the number of different exponents used.According to Molenberghs & Verbeke (2005); Rubin (1987) and Schafer (1999), between 3 and 5 imputations should be enough to characterise the variability between imputations.Thus, if it is decided on 5 changes in the exponents, between 40% and 60% variation can be produced in the weight given to ( 5) and ( 6) by starting with a fixed denominator (b = 20, for instance) and taking values (8, 9, 10, 11 and 12) for a ~ and (12, 11, 10, 9 and 8) respectively for a .These choices lead to a variation of (40%, 45%, 50%, 55% and 60%) respectively in the proportions of ( 5) and (6) in . (8) The methodology described above and proposed here uses the greatest quantity of values in Y and does not depend on any distribution in the response variable, being applicable to any matrix of numeric data.Implementation of the method was made by means of a program developed in the IML module of the statistical system SAS, which, after its execution, resulted in a data file with the M = 5 completed data sets, ready to be used in the second step of the MI.
The procedures were applied to the three reduced data matrices, containing 5%, 10% and 30% of missing values.Stability of the imputed values, or their convergence to a single value, was generally achieved within 20 iterations, but, for reasons of security and generality in the program execution, 50 iterations was the default setting.For each percentage of missing values, dispersion graphics were produced for those positions in the data matrix that were randomly deleted at the 5% deletion level.This could be done because the random deletions had the same initial value as seed for all the different percentage deletion levels.
As a measure of performance of the method at missing value position l (falling in row i and column j), the following expression was used: Penny & Jolliffe (1999), in which M is the number of imputations at that missing value position, VO l is the original randomly deleted value at that position, and is the m th imputation at that position using (8) according to the proposed method.This expression is computed for l = 1,2,...,na, where na is the total number of missing values.The expression can be separated into two terms, where the actual value imputed at position l, so that the first term represents a variance over the M values in each position and the second a bias in the final imputation.Thus the first term is a measure of precision and the second term a measure of accuracy at position l.
An overall measure of performance T acc may be computed by averaging the acc l measures as follows: T acc may similarly be broken into two components, The first component (V E ) represents the pooled variance between imputations within positions, therefore the greater its value is, the smaller is the precision of the multiple imputation method.However, a small value for this component does not necessarily mean that the imputation method is good, for the method may be biased.The second component (VQM) represents the average squared bias between the values of and VO, so the smaller the bias is, the bigger is the number of imputations that are similar to the original values and the greater is their accuracy.Therefore, the smaller are the values of V E and VQM, the larger is the multiple imputation method.

RESULTS AND DISCUSSION
Table 1 gives the results for 5%, 10% and 30% of missing values of a random deletion in the matrix.The variability between imputations within each missing value is relatively small, as shown by the small coefficients of variation.Furthermore, this variability increases as the percentage of missing data rises from 5% to 10%, except at the position (19;2) (Figures 1  and 2).When the percentage of missing values increases from 10% to 30%, the variability increases at positions (19;2), (10;4) and (19;4), and it decreases at positions (2;2), (13;4), (5;6) and (4;7) (Figures 2  and 3).The overall precision of multiple imputation is good for these data (Tables 1-2).As for the accuracy, which is the difference between the imputed values and the original values, this shows less consistency than the precision.Values of acc are nearly all larger than their counterparts Var in Table 1, but the pattern is more variable in the other table.So there is no clear relationship between precision and accuracy over the individual missing value positions.
On aggregating over missing value positions, the larger component of the general measure (T acc ) in (10) at all the missing percentages is the average squared bias, see Table 2.However, as the percent-   age of missing values rises, the variation between imputations defined by expression (11) increases while the average squared bias defined by expression (12) decreases.This results in an improvement of the overall performance, which can be seen through the decrease in the general measure as defined by expression (10).
The proposed method for multiple imputation uses the maximum amount of information available from the original data matrix and is free from assump- to this method that is here proposed is a generalisation to the exponents of and in (7) when generating the imputations (m = 1,..., M) in the first step of the MI.For full generality, if d b a d is represented as a fractional power b a d , we propose changing the exponents in (7 na = g × e × porc, with g representing the total number of genotypes, e representing the total number of environments and porc representing the percentage of missing data.

Figure 1 -
Figure 1 -Mean imputation height, standard error and standard deviation with 5% missing values.

Figure 2 -
Figure 2 -Mean imputation height, standard error and standard deviation with 10% missing values, for the same positions as the data with 5% missing values.

Figure 3 -
Figure 3 -Mean imputation height, standard error and standard deviation with 30% missing values, for the same positions as the data with 5% missing values.

Table 1 -
Imputations for mean heights (m) at each position (i th row; j th column) of a random deletion (5%, 10% and 30% missing).

Table 2 -
General performance measure of the proposed multiple imputation method, with 5%, 10% and 30% missing values.